Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: I sense there is a simpler way...

by calin (Deacon)
on Aug 19, 2004 at 17:06 UTC ( #384378=note: print w/replies, xml ) Need Help??

in reply to I sense there is a simpler way...

I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file.

A suggestion for a single-pass approach would be to make $dup{key} an array-ref:

while (<STDIN>) { chomp; if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) { ($id, $key)= ($1, $2); push @{$dup{$key}}, $id; #this is modified! # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w*/) { print "Id: $id - $key\n"; } } } print "Duplicated keys:\n"; for my $key (keys %dup) { my ($ids, $count) = map {$_, scalar @$_} $dup{$key}; next unless $count > 1; print "$key ($count)\n"; print "Id: $_\n" for @$ids; }

Update: I failed to see that you read-in the whole file in @lines to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light.


This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID.

As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.

Replies are listed 'Best First'.
Re^2: I sense there is a simpler way...
by HelgeG (Scribe) on Aug 20, 2004 at 00:24 UTC
    Random Walk and calin, thank you both for your replies. Creating a deep data structure (hash of arrays) was the step that eluded me.

    Is gobbling an entire file into an array considered bad form? My datafile is roughly half a megabyte in size, so I figured memory was not an issue. I can see however that reading the file line by line makes for more scalable code.

    What was bugging me about my own code was that I in fact had an N+1 pass approach, where N was the number of duplicated keys. I was reading the file once, and then cycling several times over the array.

    calin, you are right about the fact that there can be more than a single duplicate for any textual field, so the code needs to account for this.

    Again, thanks for the time the both of you spent looking at my code, it is much appreciated!

      Is gobbling an entire file into an array considered bad form? . . .

      One should always be aware of the efficiency concern. If you're sure the file will never be "too big", sluurping (as it's called) shouldn't be a problem. Otherwise, you'd do well to try to do per-record reading/processing, where practical.

      Calin's solution is good. If you want a little extra efficiency, you can buy it with memory, i.e. data structures. In the solution below, we maintain a separate hash for those keys which are known to be duplicates. Then, at the end, we iterate only over that hash. This has a pay-off if the number of duplicate keys is significantly smaller than the total number of keys.

      my( %keys, %dup ); while (<STDIN>) { chomp; if ( /PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/ ) { my( $id, $key ) = ( $1, $2 ); if ( exists $dup{$key} ) # already found to be a dup { push @{ $dup{$key} }, $id; } elsif ( exists $keys{$key} ) # only seen once before { push @{ $dup{$key} }, delete($keys{$key}), $id; } else # first time seen { $keys{$key} = $id; } # check if any key has init caps (not allowed) if ( $key =~ /^[A-Z]\w*/ ) { print "Id: $id - $key\n"; } } } print "\nDuplicated keys:\n\n"; for my $key ( keys %dup ) { print "Key: $key\n"; print "\tId: $_\n" for @{$dup{$key}}; }
      (Not tested)
        jdporter, thanks. I knew there was a reason for entering the monastery,and the replies I have received to my query have been interesting and educating.

        I like this last solution where instead of going through the entire data on the second pass, we only look at known duplicates.

        Having worked with perl for the last weeks has been somewhat of a revelation to me. it is amazing how much real work can be accomplished with a few lines of carefully chosen code.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://384378]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2022-05-18 22:02 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (71 votes). Check out past polls.