HelgeG has asked for the wisdom of the Perl Monks concerning the following question:
I have a large text file that contains several fields in a fixed format. Among the fields are a unique numerical ID, and a text ID that ideally should be unique. The text id also should not start with a capital letter, but sometimes does.
My script reads the file, detects IDs that start with a capital letter, and also detects if any of the textual IDs are duplicated. The numerical IDs are always unique.
I find duplicates by storing a count in a hash where the text id is the key. After I have filled the hash, I then traverse it to find values higher than one. If such a value is found, I run through the entire file again to find the numerical IDs of the duplicates.
This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:
A typical line in the data file looks like this:#!perl -w use strict; my(@lines,%dup, $id, $key); chomp(@lines=<STDIN>); # read file, put count of key in hash print "Keys with initial caps:\n"; foreach(@lines) { if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) { ($id, $key)= ($1, $2); $dup{$key} += 1; # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w*/) { print "Id: $id - $key\n"; } } } # check hash for duplicates, if found, display positions in file print "Duplicated keys:\n"; foreach $key (keys %dup) { if ($dup{$key}>1) { print "$key ($dup{$key})\n"; foreach(@lines) { if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+($key),/) { print "Id: $1\n"; } } } }
The values I look at are the first and the third value in the argument list.PROBABLECAUSE(0, probablecauseUndefined, undefined, Unknown, indetermi +nate, prim, false, "", UNIDENTIFIED, Y)
update
Using the tips I received, the solution is now cleaner and more elegant, and as an added bonus, I have learned about perl references. Thank you, monks!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: I sense there is a simpler way...
by Random_Walk (Prior) on Aug 19, 2004 at 16:54 UTC | |
Re: I sense there is a simpler way...
by calin (Deacon) on Aug 19, 2004 at 17:06 UTC | |
by HelgeG (Scribe) on Aug 20, 2004 at 00:24 UTC | |
by jdporter (Paladin) on Aug 22, 2004 at 20:30 UTC | |
by HelgeG (Scribe) on Aug 23, 2004 at 09:43 UTC |