comment on

Helge,

Here is a begining, this only parses the input data once. It could be more efficient, possibly the regexp could be turned into a split on /,\s*|\(/ and then a test made on the first part of the split, not sure what sort of data you are trying not to match but possibly something like this

my ($test, $id, $key)=split /,\s*|\(/, $_, 4;
next unless $test=/PROBABLECAUSE\w*/;
[download]

You may want to fix keys starting with a cap so they match duplicates without. I put the processing in the loop reading STDIN, this was so I could test it easily by bashing a few lines in by hand. May also be a little more efficient as the script starts working as soon as it has its first line rather than waiting till all is in.

#!/use/your/bin/perl -w
use strict;
my($line_count, @lines, %dup, @capkeys)=0;
while (<STDIN>) {
   if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) {
      my ($id, $key)= ($1, $2);
      # check if any key has init caps (not allowed)
      if ($key =~ m/^[A-Z]\w*/) {
         push @capkeys, "$line_count: $id - $key\n";
         # you may want to fix up caps here before you store
         # the key so Keysomething matches keysomething
      }
      if (exists $dup{$key}) {
         print "Duplicates found\n";
         print "$dup{$key}->[0]: $dup{$key}->[1]\n";
         print "$line_count: $_\n";
      } else {
          $dup{$key}=[$line_count, $_]; # store line
      }
      $line_count++;
   }
}
print "keys with initial caps\n" if @capkeys;
foreach (@capkeys) {print}
[download]

update

Got my split sugestion a little wrong, field misscount should be

    my ($test, $id, undef, $key)=split /,\s*|\(/, $_, 5;
    next unless $test=/PROBABLECAUSE\w*/;
[download]

In the initial caps test is the \w* really required or would if ($key =~ m/^[A-Z]/ be OK ? What if the entire key is in upper case or will that never happen ?

cleaner code, now taking liberties !

Lets just fix those upper cased keys and not trouble the poor users....

#!/your/perl -w
use strict;
my($line_count, %dup)=0;
while (<STDIN>) {
    my ($test, $id, undef, $key)=split /,\s*|\(/, $_, 5;
    next unless $test=/PROBABLECAUSE\w*/;
    # If i may be so bold...
    $key = lc $key;
    if (exists $dup{$key}) {
        print "Duplicates found\n";
        print "$dup{$key}->[0]: $dup{$key}->[1]";
        print "$line_count: $_";
    } else {
        $dup{$key}=[$line_count, $_]; # store No and line
    }
    $line_count++;
}
[download]

In reply to Re: I sense there is a simpler way... by Random_Walk
in thread I sense there is a simpler way... by HelgeG

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks