Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

to delete a set of specific lines in a file

by sugar (Sexton)
on Dec 05, 2008 at 07:52 UTC ( [id://728215]=perlquestion: print w/replies, xml ) Need Help??

sugar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, How to delete a set of specific lines in a file which occurs twice(the only difference is .f or .r). A sample input and output is given below. sample input:
input is: A0000.f BG_c22 A000X.f BG_c5 A000X.r BG_c5 A002B.f BG_c38 A002B.r BG_c38 A003A.r BG_c38 A0082.r BG_c12 A00AS.f BG_c52 A00B9.f BG_c45 A00B9.r BG_c45 A00DK.f BG_c5 A00F0.f BG_c22 A00F0.r BG_c22 A00F3.f BG_c14 A00FX.f BG_c7 A00FX.r BG_c7 result shud be: A0000.f BG_c22 A003A.r BG_c38 A0082.r BG_c12 A00AS.f BG_c52 A00DK.f BG_c5 A00F3.f BG_c14
The code i have written is:
#!/usr/bin/perl use strict; use warnings; open(FILE,"pe_real_sample.txt") or die "cannot open"; open(OUTFILE,">last.output") or die "cannot open"; my @arr=<FILE>; close(FILE); my $prev1=0;my $prev2=0;my @dels; foreach(@arr){ my @spl=split(".(f|r)",$_); if($prev1 eq $spl[0] && $prev2 eq $spl[2]){ push(@dels,$spl[0]); } $prev1=$spl[0];$prev2=$spl[2]; } my $del= join '|', map quotemeta, @dels; my @arr1=grep !/$del/,@arr; print OUTFILE "@arr1\n";
Well, everything is perfect as far as the file is small. the sample input which i have given works. but then, wen i feed a 10MB data(which contains 3.5 million lines), the program doesnt work, i mean it takes awfully a lot of time but never stops either with a bug or correct result. What do i do? plz help. P.S: This is also with reference to the post "deleting a specific element from an array".

Replies are listed 'Best First'.
Re: to delete a set of specific lines in a file
by moritz (Cardinal) on Dec 05, 2008 at 08:49 UTC

    First of all sugar++ for giving us sample input and reference output as well as a clear description of what your problem is.

    In addition to what the others have said: You build a fairly large regex from @dels, which might slow down things (if the number of items is in the millions).

    I'd suggest to use a hash instead of @dels:

    #!/usr/bin/perl use strict; use warnings; my $prev1=0;my $prev2=0;my @dels; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my %dels; while (<$file>) { # the \. prevents . from matching any character my @spl = split(/\.[fr]/, $_); if($prev1 eq $spl[0] && $prev2 eq $spl[1]){ $dels{$spl[0]} = 1; } $prev1=$spl[0];$prev2=$spl[1]; } # reset the file cursor, read from the beginning again seek $file, 0, 0; while (<$file>) { my @spl = split(/\.[fr]/, $_); print unless $dels{$spl[0]}; } close $file;

    This version gets rid of @arr entirely, and replaced the regex and @dels with %dels.

      Going a step further, we can do this in a single pass:
      #!/usr/bin/perl use strict; use warnings; open my $file, '<', 'data.txt' or die "Can't open file: $!@"; my @prev = (); while (my $line = <$file>) { my @spl = split(/(\.[fr])/, $line); if (@prev) { if ($prev[0] ne $spl[0] || $prev[2] ne $spl[2]){ print @prev; } else { @prev = (); next; } } @prev = @spl; } print @prev; close $file;
      thank u :) Now the program takes only 2 seconds to output the desired results :)
Re: to delete a set of specific lines in a file
by dHarry (Abbot) on Dec 05, 2008 at 07:57 UTC

    That's because you slurp the entire file into memory. It's better to step through the file line-by-line. It is more scalable and gives a better performance.

Re: to delete a set of specific lines in a file
by lakshmananindia (Chaplain) on Dec 05, 2008 at 08:32 UTC
  • Read the file line-by-line. Store each line in an array
  • After getting the second line, check whether the line is already there in the array
  • If so then remove the line
  • Using grep to check, will be easier
Re: to delete a set of specific lines in a file (GOLF)
by ccn (Vicar) on Dec 05, 2008 at 10:57 UTC
    #!/usr/bin/perl -- # filter.pl use strict; use warnings; die "Usage: perl filter.pl pe_real_sample.txt > last.output\n" unless @ARGV; my %seen; my @ary; # push into @ary every line which have not be seen yet while (my $line = <>) { $line =~ /[fr] / or die "Can't parse line: $line"; my $key = $` . $'; push(@ary, [$line, $key]) unless $seen{$key}++; } # print all lines from @ary having $seen only once print $_->[0] foreach grep {$seen{$_->[1]} < 2} @ary;

    This script can be written more shortly:

    #!/usr/bin/perl -n /[fr] /, $s{$`.$'}++ || push @a, [$_,$`.$'] }{ $s{$_->[1]}<2 && print +$_->[0] for @a

    or even:

    #!/usr/bin/perl -n /f|r/,$s{$`.$'}++,push@a,[$_,$`.$']}{$s{$$_[1]}<2&&print$$_[0]for@a

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://728215]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-19 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found