comment on

I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file.

A suggestion for a single-pass approach would be to make $dup{key} an array-ref:

while (<STDIN>) {
   chomp;
   if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) {
      ($id, $key)= ($1, $2);
      push @{$dup{$key}}, $id;  #this is modified!
      # check if any key has init caps (not allowed)
      if ($key =~ m/^[A-Z]\w*/) {
         print "Id: $id - $key\n";
      }
   }
}

print "Duplicated keys:\n";

for my $key (keys %dup) {
   my ($ids, $count) = map {$_, scalar @$_} $dup{$key};
   next unless $count > 1;
   print "$key ($count)\n";
   print "Id: $_\n" for @$ids;
}
[download]

Update: I failed to see that you read-in the whole file in @lines to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light.

More

This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID.

As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.

In reply to Re: I sense there is a simpler way... by calin
in thread I sense there is a simpler way... by HelgeG

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks