to delete a set of specific lines in a file

sugar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, How to delete a set of specific lines in a file which occurs twice(the only difference is .f or .r). A sample input and output is given below. sample input:

input is:
A0000.f BG_c22
A000X.f BG_c5
A000X.r BG_c5
A002B.f BG_c38
A002B.r BG_c38
A003A.r BG_c38
A0082.r BG_c12
A00AS.f BG_c52
A00B9.f BG_c45
A00B9.r BG_c45
A00DK.f BG_c5
A00F0.f BG_c22
A00F0.r BG_c22
A00F3.f BG_c14
A00FX.f BG_c7
A00FX.r BG_c7

result shud be:
A0000.f BG_c22
A003A.r BG_c38
A0082.r BG_c12
A00AS.f BG_c52
A00DK.f BG_c5
A00F3.f BG_c14
[download]

The code i have written is:

#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"pe_real_sample.txt") or die "cannot open";
open(OUTFILE,">last.output") or die "cannot open";
my @arr=<FILE>;
close(FILE);
my $prev1=0;my $prev2=0;my @dels;
foreach(@arr){
    my @spl=split(".(f|r)",$_);
    if($prev1 eq $spl[0] && $prev2 eq $spl[2]){
        push(@dels,$spl[0]);
    }
    $prev1=$spl[0];$prev2=$spl[2];
}
my $del= join '|', map quotemeta, @dels;
my @arr1=grep !/$del/,@arr;
print OUTFILE "@arr1\n";
[download]

Well, everything is perfect as far as the file is small. the sample input which i have given works. but then, wen i feed a 10MB data(which contains 3.5 million lines), the program doesnt work, i mean it takes awfully a lot of time but never stops either with a bug or correct result. What do i do? plz help. P.S: This is also with reference to the post "deleting a specific element from an array".

Comment on to delete a set of specific lines in a file Select or Download Code

Replies are listed 'Best First'.
Re: to delete a set of specific lines in a file by moritz (Cardinal) on Dec 05, 2008 at 08:49 UTC
First of all sugar++ for giving us sample input and reference output as well as a clear description of what your problem is. In addition to what the others have said: You build a fairly large regex from `@dels`, which might slow down things (if the number of items is in the millions). I'd suggest to use a hash instead of `@dels`: #!/usr/bin/perl use strict; use warnings; my $prev1=0;my $prev2=0;my @dels; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my %dels; while (<$file>) { # the \. prevents . from matching any character my @spl = split(/\.[fr]/, $_); if($prev1 eq $spl[0] && $prev2 eq $spl[1]){ $dels{$spl[0]} = 1; } $prev1=$spl[0];$prev2=$spl[1]; } # reset the file cursor, read from the beginning again seek $file, 0, 0; while (<$file>) { my @spl = split(/\.[fr]/, $_); print unless $dels{$spl[0]}; } close $file; [download] This version gets rid of `@arr` entirely, and replaced the regex and `@dels` with `%dels`.	[reply] [d/l] [select]
Re^2: to delete a set of specific lines in a file by eye (Chaplain) on Dec 05, 2008 at 09:34 UTC
Going a step further, we can do this in a single pass: `#!/usr/bin/perl use strict; use warnings; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my @prev = (); while (my $line = <$file>) { my @spl = split(/(\.[fr])/, $line); if (@prev) { if ($prev[0] ne $spl[0] \|\| $prev[2] ne $spl[2]){ print @prev; } else { @prev = (); next; } } @prev = @spl; } print @prev; close $file;` [download]	[reply] [d/l]
Re^2: to delete a set of specific lines in a file by sugar (Sexton) on Dec 05, 2008 at 09:10 UTC
thank u :) Now the program takes only 2 seconds to output the desired results :)	[reply]
Re: to delete a set of specific lines in a file by dHarry (Abbot) on Dec 05, 2008 at 07:57 UTC
That's because you slurp the entire file into memory. It's better to step through the file line-by-line. It is more scalable and gives a better performance.	[reply]
Re: to delete a set of specific lines in a file by lakshmananindia (Chaplain) on Dec 05, 2008 at 08:32 UTC
Read the file line-by-line. Store each line in an array After getting the second line, check whether the line is already there in the array If so then remove the line Using grep to check, will be easier	[reply]
Re: to delete a set of specific lines in a file (GOLF) by ccn (Vicar) on Dec 05, 2008 at 10:57 UTC
#!/usr/bin/perl -- # filter.pl use strict; use warnings; die "Usage: perl filter.pl pe_real_sample.txt > last.output\n" unless @ARGV; my %seen; my @ary; # push into @ary every line which have not be seen yet while (my $line = <>) { $line =~ /[fr] / or die "Can't parse line: $line"; my $key = $` . $'; push(@ary, [$line, $key]) unless $seen{$key}++; } # print all lines from @ary having $seen only once print $_->[0] foreach grep {$seen{$_->[1]} < 2} @ary; [download] This script can be written more shortly: #!/usr/bin/perl -n /[fr] /, $s{$`.$'}++ \|\| push @a, [$_,$`.$'] }{ $s{$_->[1]}<2 && print +$_->[0] for @a [download] or even: #!/usr/bin/perl -n /f\|r/,$s{$`.$'}++,push@a,[$_,$`.$']}{$s{$$_[1]}<2&&print$$_[0]for@a [download]	[reply] [d/l] [select]


Welcome to the Monastery
	PerlMonks