comment on

You can use a hash to help you decide what to do on later lines something like this:

$ cat foo.pl
use strict;
use warnings;

# Read the input file.  Trim trailing whitespace
# and preserve the line number.
my $cnt = 0;
my @inp = map { s/\s+$//; [ ++$cnt, $_ ] } <DATA>;

print "INPUT LINES:\n";
print join(": ", @$_), "\n" for @inp;

# Process the file.  We'll keep the first record for
# each key we find and ignore all successive values
# with two exceptions:  First, we won't process a
# 'foo' record until we've handled a 'bar'.  Second,
# we won't handle a 'baz' record in the first five
# lines.
my %seen;
my @out;

for my $rLine (@inp) {

    # Parse out the interesting fields
    my $line_num = $rLine->[0];

    # parse out the interesting fields
    my ($key, $val) = split /\s+/, $rLine->[1];

    # ignore keys we've already processed
    next if $seen{$key};

    # don't process 'foo' until we've handled 'baz'
    next if $key eq 'foo' and ! exists $seen{baz};

    # don't process 'baz' in the first five lines
    next if $key eq 'baz' and $line_num < 5;

    # process the line and remember the key
    push @out, $rLine->[1];
    ++$seen{$key};
}

print "\n\nOUTPUT LINES:\n";
print $_, "\n" for @out;
__DATA__
foo the
bar quick
baz red
bar fox
foo jumped
biz over
bar the
bim lazy
baz red
foo dog
[download]

As you process your file, you record the important decisions you've made in the hash to help guide future decisions.

In the example I cobbled together, I used three rules:

Only process a 'foo' record if we've already processed a 'baz' record.
Ignore 'baz' records occurring in the first five lines of the file.
Otherwise, keep the first record of each type we find.

Using these rules, when we run the program we get:

$ perl foo.pl
INPUT LINES:
1: foo the
2: bar quick
3: baz red
4: bar fox
5: foo jumped
6: biz over
7: bar the
8: bim lazy
9: baz red
10: foo dog


OUTPUT LINES:
bar quick
biz over
bim lazy
baz red
foo dog
[download]

As you can see, we're able to handle all the rules with a single pass over the file with the help of a little bookkeeping.

As you've guessed in your original post, the nested loop can consume quite a bit of time for a large file. So it's worthwhile to think of ways you can do your processing without having to repeatedly scan the file.

What if you wanted to keep the *last* line starting with each key? One way would be to leave the logic the same, but to process the records in reverse order. Another way would be to change the way you handle the "seen" hash: Instead of checking whether you've processed the key or not, you could store the data you want to keep in it. That way, you can simply overwrite each record with a later record if you want, and then output them at the end. If you're keeping your data in memory, you can even come up with a method to process the data in *one* order and output the data in a *different* order to make your task simpler.

It's often a mistake to immediately jump in and solve the problem until you think about how to simplify things. Sometimes you'll find that a problem could easily be solved if the data came in a more convenient form or order. In those cases, it may be profitable to simply reshape or reorder the data to suit and then solve the simpler problem.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re^3: How to check lines that start with the same word then delete one of them by roboticus
in thread How to check lines that start with the same word then delete one of them by agnes00

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks