http://qs321.pair.com?node_id=384503


in reply to Re: I sense there is a simpler way...
in thread I sense there is a simpler way...

Random Walk and calin, thank you both for your replies. Creating a deep data structure (hash of arrays) was the step that eluded me.

Is gobbling an entire file into an array considered bad form? My datafile is roughly half a megabyte in size, so I figured memory was not an issue. I can see however that reading the file line by line makes for more scalable code.

What was bugging me about my own code was that I in fact had an N+1 pass approach, where N was the number of duplicated keys. I was reading the file once, and then cycling several times over the array.

calin, you are right about the fact that there can be more than a single duplicate for any textual field, so the code needs to account for this.

Again, thanks for the time the both of you spent looking at my code, it is much appreciated!

Replies are listed 'Best First'.
Re^3: I sense there is a simpler way...
by jdporter (Paladin) on Aug 22, 2004 at 20:30 UTC
    Is gobbling an entire file into an array considered bad form? . . .

    One should always be aware of the efficiency concern. If you're sure the file will never be "too big", sluurping (as it's called) shouldn't be a problem. Otherwise, you'd do well to try to do per-record reading/processing, where practical.

    Calin's solution is good. If you want a little extra efficiency, you can buy it with memory, i.e. data structures. In the solution below, we maintain a separate hash for those keys which are known to be duplicates. Then, at the end, we iterate only over that hash. This has a pay-off if the number of duplicate keys is significantly smaller than the total number of keys.

    my( %keys, %dup ); while (<STDIN>) { chomp; if ( /PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/ ) { my( $id, $key ) = ( $1, $2 ); if ( exists $dup{$key} ) # already found to be a dup { push @{ $dup{$key} }, $id; } elsif ( exists $keys{$key} ) # only seen once before { push @{ $dup{$key} }, delete($keys{$key}), $id; } else # first time seen { $keys{$key} = $id; } # check if any key has init caps (not allowed) if ( $key =~ /^[A-Z]\w*/ ) { print "Id: $id - $key\n"; } } } print "\nDuplicated keys:\n\n"; for my $key ( keys %dup ) { print "Key: $key\n"; print "\tId: $_\n" for @{$dup{$key}}; }
    (Not tested)
      jdporter, thanks. I knew there was a reason for entering the monastery,and the replies I have received to my query have been interesting and educating.

      I like this last solution where instead of going through the entire data on the second pass, we only look at known duplicates.

      Having worked with perl for the last weeks has been somewhat of a revelation to me. it is amazing how much real work can be accomplished with a few lines of carefully chosen code.