http://qs321.pair.com?node_id=11115310

agnes00 has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys, I've a file (with a lot of string lines of this format .*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?(?:$)); I want to check those who start with the same first word after the first semicolon, then delete one of them from the file (the one that have a particular format). I write this code:
#!/usr/bin/perl use warnings; my $input = 'lines.txt'; open (FILE, "<", $input) or die "Can not open $input $!"; my @lines = <FILE>; close FILE; my @forlines=(); foreach(@lines){ if( ( defined $_) and !($_ =~ /^\s*$/)){ push(@forlines, $_); } } my @buffer = (); foreach $line(@forlines) { $line =~/(.*?);.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?(?:$)/; my $var = $1; foreach $line2(@forlines) { $line2 =~/(.*?);.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?;.*?(?: +$)/; if ($line2 eq $line) { next; } if($var eq $1 and $line2 =~ /(.*?);.*?;SU LI IR ST;.*? +;SU LI IR ST;.*?;.*?;.*?;.*?;.*?(?:$)/) { push @buffer, $line2; } } } foreach $line(@buffer) { @forlines = grep {!/$line/} @forlines; } open(my $file, '>', $input) or die "Could not open file '$input' $!"; truncate $file,0; print $file @forlines;
it works but it is so slow (20 min to formating the file), I think because of the two nested loops. Is there any better solution to do that ? BR
  • Comment on How to check lines that start with the same word then delete one of them
  • Download Code

Replies are listed 'Best First'.
Re: How to check lines that start with the same word then delete one of them
by LanX (Saint) on Apr 10, 2020 at 10:44 UTC
    Yes just one loop and a %seen hash.

    You put $var into %seen with $seen{$var}++ and whenever it's already set you'll know it needs to be checked.

    I'm not sure though how you want the first line to be handled, and you didn't provide test data.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I didn't see how hash can do it. About the check, I test if the first word after the first semicolon match another word in the same file. Example :
      S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0
      here S_FER_SCAM1_ARRESTO match
        use strict; use warnings; use Test::More tests => 1; my @in = ( 'S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0', 'S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0' ); my @want = ( 'S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0', ); my @have; my %seen; for (@in) { /^(\w+)/; if (exists $seen{$1}) { next if (/SU LI IR ST/); # More code here if it doesn't match - this section not descri +bed. } $seen{$1} //= $_; push @have, $_; } is_deeply \@have, \@want, 'Arrays match';

        See also How to ask better questions using Test::More and sample data.

        agnes00:

        You can use a hash to help you decide what to do on later lines something like this:

        $ cat foo.pl use strict; use warnings; # Read the input file. Trim trailing whitespace # and preserve the line number. my $cnt = 0; my @inp = map { s/\s+$//; [ ++$cnt, $_ ] } <DATA>; print "INPUT LINES:\n"; print join(": ", @$_), "\n" for @inp; # Process the file. We'll keep the first record for # each key we find and ignore all successive values # with two exceptions: First, we won't process a # 'foo' record until we've handled a 'bar'. Second, # we won't handle a 'baz' record in the first five # lines. my %seen; my @out; for my $rLine (@inp) { # Parse out the interesting fields my $line_num = $rLine->[0]; # parse out the interesting fields my ($key, $val) = split /\s+/, $rLine->[1]; # ignore keys we've already processed next if $seen{$key}; # don't process 'foo' until we've handled 'baz' next if $key eq 'foo' and ! exists $seen{baz}; # don't process 'baz' in the first five lines next if $key eq 'baz' and $line_num < 5; # process the line and remember the key push @out, $rLine->[1]; ++$seen{$key}; } print "\n\nOUTPUT LINES:\n"; print $_, "\n" for @out; __DATA__ foo the bar quick baz red bar fox foo jumped biz over bar the bim lazy baz red foo dog

        As you process your file, you record the important decisions you've made in the hash to help guide future decisions.

        In the example I cobbled together, I used three rules:

        1. Only process a 'foo' record if we've already processed a 'baz' record.
        2. Ignore 'baz' records occurring in the first five lines of the file.
        3. Otherwise, keep the first record of each type we find.

        Using these rules, when we run the program we get:

        $ perl foo.pl INPUT LINES: 1: foo the 2: bar quick 3: baz red 4: bar fox 5: foo jumped 6: biz over 7: bar the 8: bim lazy 9: baz red 10: foo dog OUTPUT LINES: bar quick biz over bim lazy baz red foo dog

        As you can see, we're able to handle all the rules with a single pass over the file with the help of a little bookkeeping.

        As you've guessed in your original post, the nested loop can consume quite a bit of time for a large file. So it's worthwhile to think of ways you can do your processing without having to repeatedly scan the file.

        What if you wanted to keep the *last* line starting with each key? One way would be to leave the logic the same, but to process the records in reverse order. Another way would be to change the way you handle the "seen" hash: Instead of checking whether you've processed the key or not, you could store the data you want to keep in it. That way, you can simply overwrite each record with a later record if you want, and then output them at the end. If you're keeping your data in memory, you can even come up with a method to process the data in *one* order and output the data in a *different* order to make your task simpler.

        It's often a mistake to immediately jump in and solve the problem until you think about how to simplify things. Sometimes you'll find that a problem could easily be solved if the data came in a more convenient form or order. In those cases, it may be profitable to simply reshape or reorder the data to suit and then solve the simpler problem.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: How to check lines that start with the same word then delete one of them
by kcott (Archbishop) on Apr 10, 2020 at 18:31 UTC

    G'day agnes00,

    The following reads through the input file twice: once to gather information; a second time to generate the output. The data that's collected is small; there's no reproduction of the data already in the input file. There are no nested loops. The only regex is the /;/ in the split statement.

    You only showed two lines of sample data: this code will exclude the record with the two instances of 'SU LI IR ST' regardless of the order in which it appears. I've added other test data: I've no idea whether that's representative of your real data.

    #!/usr/bin/env perl use strict; use warnings; use autodie ':all'; use constant { MATCH => 0, CHECK1 => 2, CHECK2 => 4, FLAG => 'SU LI IR ST', RECORD => 0, EXCLUDE => 1, }; my $backup = 'pm_11115310_lines_BU.txt'; my $temp = 'pm_11115310_lines_TMP.txt'; my $input = 'pm_11115310_lines.txt'; `cp $backup $input`; _print_data($input); my (%seen, %exclude); { open my $fh, '<', $input; while (<$fh>) { chomp; ++$exclude{$.} and next unless length; my @parts = split /;/, $_, 6; my $possible_exclude = $parts[CHECK1] eq FLAG && $parts[CHECK2] eq FLAG; if (exists $seen{$parts[MATCH]}) { if ($possible_exclude) { ++$exclude{$.}; } else { if ($seen{$parts[MATCH]}[EXCLUDE]) { ++$exclude{$seen{$parts[MATCH]}[RECORD]}; $seen{$parts[MATCH]}= [$., 0]; } } } else { $seen{$parts[MATCH]} = [$., $possible_exclude]; } } seek $fh, 0, 0; $. = 0; open my $tmp, '>', $temp; while (<$fh>) { next if $exclude{$.}; print $tmp $_; } } `cp $temp $input`; _print_data($input); sub _print_data { my ($file) = @_; print '-' x 20, " $file ", '-' x 20, "\n"; system cat => $file; print '-' x (42 + length $file), "\n"; }

    Output:

    -------------------- pm_11115310_lines.txt -------------------- S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;XU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;SU LI IR ST;0;XU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;XU LI IR ST;0;XU LI IR ST;1;0;TS;0;0 S_FER_SCAM3_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM4_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM5_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM5_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 --------------------------------------------------------------- -------------------- pm_11115310_lines.txt -------------------- S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;XU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;SU LI IR ST;0;XU LI IR ST;1;0;TS;0;0 S_FER_SCAM2_ARRESTO;ARRESTO;XU LI IR ST;0;XU LI IR ST;1;0;TS;0;0 S_FER_SCAM3_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM4_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0 S_FER_SCAM5_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 ---------------------------------------------------------------

    Do note my use of a backup file. Simply overwriting your input data is not a good move at all!

    — Ken