Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Help with removing dupes from a string with perl

by CG_man (Novice)
on Jul 09, 2009 at 04:35 UTC ( #778479=perlquestion: print w/replies, xml ) Need Help??

CG_man has asked for the wisdom of the Perl Monks concerning the following question:

I've tried many different things on this question that I found on different forums and nothing seems to work. My question is how can you remove duplicate entries from a SQL query in a perl script, where it keeps the newest entry? Right now the results are as follows from the query;

Jobname status Comment Reason
------- ------ ------- ------
job A HOLD 12:35 per test
job B HOLD 1:45 per user
job C HOLD 3:00 maint
job C HOLD 3:30 per Failure

I need the results to only give me the newest job C and not both.
Here is my perl script

#!/users/contrib/bin/perl

use Sybase::Sybperl;
use Env qw(BATCH_USER_NAME BATCH_PASSWORD);

##############
############## Subroutines
##############

# Send input string to the DB server and get the results
# $_[0] - sql string
sub DBEXEC {
&dbcmd($dbproc, $_[0]);
&dbsqlexec($dbproc);
&dbresults($dbproc);
}
# Turn UTC in to people time
sub GETTIME {
if ($_[0] eq "999999999") {
$time = ("******** **********");
} else {
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($_[0]);
$mon++;
$time = sprintf ("%.2d:%.2d:%.2d %.2d/%.2d/%.2d", $hour, $min, $sec, $mon, $mday, $year + 1900);
}
}
# set job status
sub JOBSTATUS {
if ($_[0] == 9) {$jstatus1 = ("ACTIVATED");}
elsif ($_[0] == 5) {$jstatus2 = ("FAILURE");}
elsif ($_[0] == 8) {$jstatus3 = ("INACTIVE");}
elsif ($_[0] == 11) {$jstatus4 = ("ON_HOLD");}
elsif ($_[0] == 7) {$jstatus5 = ("ON_ICE");}
elsif ($_[0] == 12) {$jstatus5 = ("QUE_WAIT");}
elsif ($_[0] == 10) {$jstatus6 = ("RESTART");}
elsif ($_[0] == 1) {$jstatus7 = ("RUNNING");}
elsif ($_[0] == 3) {$jstatus8 = ("STARTING");}
elsif ($_[0] == 4) {$jstatus9 = ("SUCCESS");}
elsif ($_[0] == 6) {$jstatus10 = ("TERMINATED");}
}

# set event status
sub EVENTSTCODE {
if ($_[0] == 110) {$jstatus1 = ("JOB_ON_ICE");}
elsif ($_[0] == 112) {$jstatus2 = ("JOB_ON_ICE");}
elsif ($_[0] == 106) {$jstatus3 = ("ALARM");}
elsif ($_[0] == 108) {$jstatus4 = ("FORCE_START");}
elsif ($_[0] == 105) {$jstatus5 = ("KILLJOB");}
elsif ($_[0] == 111) {$jstatus5 = ("JOB_OFF_ICE");}
elsif ($_[0] == 113) {$jstatus6 = ("JOB_OFF_HOLD");}
elsif ($_[0] == 117) {$jstatus7 = ("COMMENTS");}

}

format STDOUT6 =
@<<<< @<<<< @<<<< @<<<<
$name, $jstatus, $estamp, $etext
.

$reporttime = &GETTIME(time);

# login to the database
$dbproc = &dblogin(USER, PSSWD, 'SQL_DEV_SERVER');

############
############ Loop for job status check on HOLD.
############
foreach $statusnumber (11) {
foreach $eventnumber (117) {
$statusname = &JOBSTATUS($statusnumber);
$eventstatus = &EVENTSTCODE($eventnumber);
# Submit SQL Statement
$~ = STDOUT6;
print "Jobs on HOLD w/comments last Issued for job \n";
print " \n";
print "Jobname Status Comment reason\n";
print "------- ------ ------- ---------\n";
&DBEXEC("select proc_eventvu.job_name, jobst.status, proc_eventvu.stamp, proc_eventvu.text from proc_eventvu, jobst where proc_eventvu.joid = jobst.joid and proc_eventvu.event in (112,$eventnumber) and jobst.status = $statusnumber order by job_name");
while(@status = &dbnextrow($dbproc)){
$name = ($status[0]);
$jstatus = &JOBSTATUS($status1);
$estamp = ($status2);
$etext = ($status3);

write;}
print " \n";
print " \n";
}
}

I'd appreciate any help, code, and explination so I can understand what I am missing
  • Comment on Help with removing dupes from a string with perl

Replies are listed 'Best First'.
Re: Help with removing dupes from a string with perl
by 1Nf3 (Pilgrim) on Jul 09, 2009 at 05:57 UTC

    Please, use <code> </code> tags when posting your code.

    As to the question, a quick fix solution you could try is getting the data from the database sorted descending by time, so that the newest times come first. Then, you can push the first occurence of a job name onto a list of data rows, and when the while loop ends, you can print the rows.

    # I'm not sure if it's the correct statement # for getting the data sorted newest first, test it first &DBEXEC("select proc_eventvu.job_name, jobst.status, proc_eventvu.stam +p, proc_eventvu.text from proc_eventvu, jobst where proc_eventvu.joid = jobst.joid and proc_eventvu.event in (112,$eventnumber) and jobst.status = $statusnumber order by proc_eventvu.stamp desc"); my %already_processed; my @results_lol; while(@status = &dbnextrow($dbproc)) { $name = ($status[0]); $jstatus = &JOBSTATUS($status1); $estamp = ($status2); $etext = ($status3); # only the job data for a jobname # that wasn't processed before # make it to the result list unless (defined $already_processed{$name}) { push @results_lol, [($name, $jstatus, $estamp, $etext)]; $already_processed{$name} = 1; } } # here, if you want the rows sorted by name, you could # write some necessary code for my $row_ref (@results_lol) { my ($name, $jstatus, $estamp, $etext) = @$row_ref; write; }

    It's untested, but try if it works. And I wholeheartedly second the suggestion about  use strict; use warnings;, as well as the advice about the JOBSTATUS subroutine.

    Regards,
    Luke

Re: Help with removing dupes from a string with perl
by jrsimmon (Hermit) on Jul 09, 2009 at 12:00 UTC
    JavaFan has the right idea. This is more easily done by a properly formatted SQL select statement.

    use strict; use warnings; use DBI; my @alljobData = (); my $dbh = DBI->connect(...); my $stmt = "SELECT JOBNAME, STATUS, TIMESTAMP, COMMENT FROM YOUR_TABLE +S WHERE JOBNAME = ? AND STATUS = ? ORDER BY TIMESTAMP DESC"; my $sth = $dbh->prepare($stmt); foreach my $job (@jobname){ my $rc = $sth->execute($job, "HOLD"); #either check for an error or let failing on fetch do that for you if(my @thisjobData = $sth->fetchrow_array){ push(@allJobData, @thisjobData); } }

    You could actually write the sql in a way that it would pull only the most current row for every job name in a single statement, but I don't know the syntax off the top of my head.
      There are several ways to do it, depending on your database. The most portable is:
      SELECT t.jobname , t.status , t.timestamp , t.comment FROM your_table t JOIN ( SELECT jobname, MAX(timestamp) as timestamp FROM your_table GROUP BY jobname ) max_timestamp ON max_timestamp.jobname = t.jobname AND max_timestamp.timestamp = t.timestamp
      Of course this assumes that you don't have 2 entries with the same name and the same timestamp. It is usually less efficient than selecting back and filtering outside of the database.

      Google tells me that Sybase doesn't yet support the better SQL 2003 solution of analytic queries.

      Also a note about your solution. I would strongly recommend ordering on jobname and then timestamp. This will put all of the records for each jobname together, making it easy to figure out which is first and which you should throw. Otherwise you need to keep a potentially large hash of which jobs you've seen. (You skipped that step in your code.)

        Heh, I just rolled my own first() and last() aggregators in Postgresql yesterday because I wanted something like this (googling for analytic queries found a somewhat related link that mentioned first() and last() aggregators as I was curious if my solution was in-line with SQL 2003's solution).

        The process was a bit more interesting and difficult than I expected. It convinced me that Postgresql doesn't have a general solution for user-defined polymorphic aggregators because it doesn't allow polymorphic user-defined data types (you can't use "anyelement" as part of a "create type" construct -- at least in the versions I was dealing with, 8.2 and 8.3).

        So I was able to create some polymorphic aggregators but can see cases where these approaches would not work (and the first route I tried failed).

        The following test query:

        select first(v), first_nonnull(v), last(v), last_nonnull(v) from ( select NULL::text as v union all select 'one' as v union all select 'two' as v ) as q union all select first(v), first_nonnull(v), last(v), last_nonnull(v) from ( select 'one'::text as v union all select 'two' as v union all select NULL as v ) as q;

        produces the following results:

        first | first_nonnull | last | last_nonnull -------+---------------+------+-------------- | one | two | two one | one | | two (2 rows)

        - tye        

        In Sybase you can use a a group by and a having clause, even without using any aggregates in the main query.

        So something like

        select jobname, status, timestamp, comment from the_table group by jobname having timestamp = max(timestamp)
        should return something reasonable without the need for a sub-query.

        Michael

      The SQL for looking up a unique column will roughly look like this(depending upon your database),

      SELECT DISTINCT(JOBNAME), STATUS, TIMESTAMP, COMMENT FROM YOUR_TABLES +WHERE STATUS = ? ORDER BY TIMESTAMP DESC

      Now you remove that loop for looking up individual job names.

        With DB2, using DISTINCT in that manner won't guarantee that you get the most recent result. DISTINCT will be evaluated before the ORDER BY clause and may not find the most recent record first. I'm not sure whether this behavior is standard among databases or not.

        From the db2 doc:
        When values match in the specified columns of two or more rows, the distinct operator retains only one of the duplicate rows in the result data set and discards the others. When duplicate rows are found, only the first row that is read during processing is retained. Because the order of rows is not guaranteed during processing, choose carefully when you select columns for the distinct operation.
        To ensure you get the most recent occurrence, you'd probably need to use a temp table in the statement (at least that's how I've seen it done).
Re: Help with removing dupes from a string with perl
by Anonymous Monk on Jul 09, 2009 at 05:13 UTC
Re: Help with removing dupes from a string with perl
by JavaFan (Canon) on Jul 09, 2009 at 08:56 UTC
    I would let the database do the work. Write your SQL so that it doesn't return rows you don't want.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://778479]
Approved by AnomalousMonk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-12-01 00:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?