Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file.

A suggestion for a single-pass approach would be to make $dup{key} an array-ref:

while (<STDIN>) { chomp; if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) { ($id, $key)= ($1, $2); push @{$dup{$key}}, $id; #this is modified! # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w*/) { print "Id: $id - $key\n"; } } } print "Duplicated keys:\n"; for my $key (keys %dup) { my ($ids, $count) = map {$_, scalar @$_} $dup{$key}; next unless $count > 1; print "$key ($count)\n"; print "Id: $_\n" for @$ids; }

Update: I failed to see that you read-in the whole file in @lines to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light.

More

This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID.

As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.


In reply to Re: I sense there is a simpler way... by calin
in thread I sense there is a simpler way... by HelgeG

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-25 11:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found