Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

some efficiency, please

by Anonymous Monk
on Apr 12, 2019 at 15:01 UTC ( #1232492=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have some text files (several hundred megabytes each) that I am processing. To simplify, I am going through the sections (think of them as paragraphs) and removing lines that are "ref n" (where n is an integer). There will be just a few hundred of these per file.

So I am just reading the whole file into memory, and substituting out the offending lines (I am actually removing the "ref n" lines, and not the lines that begins with n, which I am matching in the first line of code below).

while ($allfile =~ /^(\d+) /mg) { my $objectref = 'ref' . $1; push (@objects, $objectref); } for (@objects) { $allfile =~ s/^ +$_\n//m; } print $allfile

This worked fine (although likely far from the best way to do it), taking about ten minutes or so on average. Then I found out that (rarely - maybe once every dozen files or so) some lines that I need to remove will actually be "foo ref n". No problem, I thought. I just changed to code to:

for (@objects) { $allfile =~ s/^ *(foo)? +$_\n//mn; }

Something is not working as I expected. :) I am ninety minutes into processing the first file after the code change, and there is no sign of any progress. Why is it taking so long, and how can I improve my algorithm / code? Thank you in advance.

Replies are listed 'Best First'.
Re: some efficiency, please
by haukex (Bishop) on Apr 12, 2019 at 15:08 UTC

    Is there a good reason why you need to read the entire file into memory at once? If you're doing the removal process when you read the files, then you might want to do so while reading the file line-by-line. By the way, I'm not sure how your regexes line up with the data you showed, for example you say "foo ref n", but the regex seems to say there might be spaces before the "foo"? Please show an SSCCE that includes short but representative sample input data and the expected output for that input.

    open my $fh, '<', $filename or die "$filename: $!"; while (<$fh>) { next if /^(?:foo )?ref \d+\b/; chomp; # process the line, for example: push @lines, $_; } close $fh;
      This should better reflect what I am actually trying to do (assuming I didn't make any errors).

      #!/usr/bin/perl -w use strict; local $/=undef; my @objects; # check for basic syntax if ($#ARGV < 0) { die "Usage: program.pl file.text\n"; } my $rgxpar = qr{(^begin\n(\d+)\n.*?^end$)}mos; open (FILNAM, '<', $ARGV[0]) or die "Can't open $ARGV[0] for reading.\n"; my $allfile = <FILNAM>; close FILNAM or die "Can't close $ARGV[0] for reading.\n"; while ($allfile =~ /$rgxpar/g) { my $objectref = 'ref' . $2; if ($1 =~ /bar/ ) { push (@objects, $objectref); } } for (@objects) { $allfile =~ s/^ *(foo)? +$_\n//mn; } open ( OUTFIL, '>', "$ARGV[0].removed") or die "Can't open $ARGV[0].removed for writing.\n"; print OUTFIL $allfile; close OUTFIL or die "Can't close $ARGV[0].removed for writing.\n";
        If the files are very large, you'll spend more time disk swapping than actually reading/writing.

        Make 2 passes: Record all of the "ref" numbers you want to delete in the first pass (use a hash), then reread the file, printing it out according to whether a ref value is in the hash.

        But to do this well, with multiline data, you'll have to tell us what a "paragraph" is, because it's not clear to me from your description.

        It might look something like this:

        my %ignore; # First pass while (<FH>) { $ignore{$1} = 1 if some_condition($_); } # Second pass # reset the file to the beginning seek FH, 0, 0; while (<FH>) { if (m/matches interesting string with (capture)/) { if (exists($ignore{$1}) { next; # don't print this line print; }

        The trick, of course, is some_condition;

        If it's hard to put a single paragraph into a regex, just note the signposts with flags. Something like this for the first pass:

        my $in_paragraph; my $bar; my %ignore; while (<FH>) { if (m/start of paragraph/) { $in_paragraph = 1; $bar = 0 next; } if (m/end of paragraph/) { $in_paragraph = 0; next; } if (m/line with bar/) { $bar = 1; next; } if (m/line with ref (\d+)/) { if ($begin and $bar and not $end) { $ignore{$1} = 1; } next; } }

        And then something very similar to that in the 2nd pass, except printing or not printing based on your logic. (If you were very clever, you could reuse that code, with a tweak, passing a parameter for the pass number. But don't get clever until it works.)

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      Sorry, I am trying to simplify things, and likely making them more complicated. :(

      I think I can provide some sample data easier than I can change the code to make it work on the sample.

      This is what the input would look like:

      ref 1 ref 2 ref 3 ref 4 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 foo ref 6 ref 7 begin 5 end begin 6 bar begin 7 bar end

      So I am trying to remove only the "ref n" lines (not the n lines themselves), and only for paragraphs where "bar" appears in the paragraph. The output should look like this:

      ref 1 ref 3 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 begin 5 end begin 6 bar begin 7 bar end

      So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them.

        So I do (think I) need to pass through the file twice - once to find the references I want to remove, and once to actually remove them.

        I took that as a challenge ;-) This only needs a single pass by reversing both the input and output by piping it through tac, and produces your desired output:

        use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; open my $ofh, '|-', 'tac' or die "tac (out): $!"; open my $ifh, '-|', 'tac', $INFILE or die "tac $INFILE: $!"; my ($aminblock,$prevnum,$foundstr); my %found; while (<$ifh>) { chomp; my $out=1; if (!$aminblock) { if (/^end$/) { undef $foundstr; $aminblock=1 } elsif (/^\s*(?:foo\s+)?ref\s+(\d+)\s*$/) { die "ref $1 without block?" unless exists $found{$1}; $out = !$found{$1}; } else { die "unexpected outside of a block: $_" } } else { if (/^\s*(\d+)\s*$/) { $prevnum=$1 } elsif (/^begin$/) { die "block ended without number?" unless defined $prevnum; $found{$prevnum} = $foundstr; undef $prevnum; $aminblock=0; } else { undef $prevnum; if (/bar/) { $foundstr=1 } } } print {$ofh} $_, "\n" if $out; } close $ifh or die "tac $INFILE: ".($!||"\$?=$?"); close $ofh or die "tac (out): ".($!||"\$?=$?");

        Although the two passes through tac might actually make that less efficient for large files. Here's a two-pass version:

        use warnings; use strict; die "Usage: $0 INFILE\n" unless @ARGV==1; my $INFILE = shift @ARGV; use constant { STATE_IDLE=>0, STATE_BEGIN=>1, STATE_INBLOCK=>2 }; open my $fh, '<', $INFILE or die "$INFILE: $!"; my %found; my $state = STATE_IDLE; my $curnum; for my $pass (1..2) { while (<$fh>) { chomp; my $out = 1; if ($state==STATE_IDLE) { if (/^\s*(?:foo\s+)?ref\s+(\d+)\s*$/) { $out=!$found{$1} } elsif (/^begin$/) { $state=STATE_BEGIN } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_BEGIN) { if (/^\s*(\d+)\s*$/) { $curnum=$1; $state=STATE_INBLOCK } else { die "unexpected in state $state: $_" } } elsif ($state==STATE_INBLOCK) { if (/^end$/) { $state=STATE_IDLE } elsif (/bar/) { $found{$curnum}=1 } } else { die "bad state $state" } print $_, "\n" if $pass==2 && $out; } die "unexpected state at eof: $state" unless $state==STATE_IDLE; seek $fh, 0, 0 or die "seek $INFILE: $!"; } close $fh;

        Update: Note that these solutions don't remove ref N lines if they appear inside begin...end blocks; this was an assumption I made, but it's actually unclear what the desired behavior is in that case?

        Oops, left out an end line in the example data.

        begin 6 bar

        SHOULD BE:

        begin 6 bar end
Re: some efficiency, please (updated)
by AnomalousMonk (Bishop) on Apr 12, 2019 at 23:44 UTC

    Here's another way to approach the problem. It is, frankly, over-engineered, but I want to try to illustrate some general ideas I have found useful. Among them are:

    • The  MAIN block isolates all lexical variables; there are no global variables in the script.
    • Factoring regular expressions (to a rather extreme degree in this case!).
    • Dynamically building a regex from substrings extracted from the data; see haukex's article Building Regex Alternations Dynamically.

    This script runs correctly against the example input/output files posted here (and fixed here!) (update: but see Update below about use of  /o modifier in the  s/// substitution).
    File scrub_ref_1.pl:

    Update: For some inexplicable reason, I used the  /o modifier with the  s/// substitution in the code above. This modifier (see Regexp Quote-Like Operators in perlop) forces the regex to be compiled once and only once during execution of the script. The substitution should "properly" be
        $allfile =~ s{ ^ (?: [ ]* foo)? [ ]+ ref [ ] $rx_del_ref_n \n }
                     {}xmsg;

    As the script stands, processing only one file per invocation, the  /o modifier does no harm, but confers no benefit; the  s/// match regex is compiled and executed only once in any case. A problem arises if this code, which appears to work perfectly well, is recycled into another script that processes multiple files per invocation, a natural extension. In this case, the  $rx_del_ref_n dynamic regex compiled for the first file processed will be used for all subsequent files because the  s/// into which it is interpolated will never be re-compiled. Depending on the data being processed, this bug may be very difficult to spot!


    Give a man a fish:  <%-{-{-{-<

Re: some efficiency, please
by tybalt89 (Parson) on Apr 12, 2019 at 17:05 UTC

    Try this. It passes your test case :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1232492 use strict; use warnings; $_ = do { local $/; <DATA> }; my @del =map /^(?=(\d+)\n)(?=.*^bar\n)/ms, /^begin\n(.*?)^end\n/gms; my $pattern = do { local $" = '|'; qr/^\s+(foo )?ref (@del)\n/mn }; s/$pattern//gm; print; __DATA__ ref 1 ref 2 ref 3 ref 4 begin 1 end begin 2 bar end begin 3 end begin 4 bar end ref 5 foo ref 6 ref 7 begin 5 end begin 6 bar end begin 7 bar end

      Note that this also removes "ref N" when they appear inside one of the begin...end blocks; the spec is unclear whether that is desired or not. Update: The same goes for AnomalousMonk's solution. Some more test cases from the OP would be helpful here :-)

        <simplistic_answer> It passes *all* the test cases, therefor it's correct. </simplistic_answer>

        hehehe

      Thanks. There are just so many tricks I never thought of (like building the entire array into the qr statement). :)

      Would have liked to have used map, but the actual data is more like:

      ref 1 ref 2 1 begin end 2 begin bar end ...

      with the numbers outside (right before) the beginning of each paragraph. I am sure that can be done with map, but it is a little too tricky for me at my level. :)

      I killed what I had before after thirteen hours of CPU time, but I guess it (eventually) would have finished. I started it again with a print statement right before it removes each line from the file, and it starts out very quickly (just a few seconds per line removed), and then just keeps slowing down (after about a half-hour, it was well over a minute per line removed).

      Still don't understand why just changing the one line:

      for (@objects) { $allfile =~ s/^ +$_\n//m; }

      to:

      for (@objects) { $allfile =~ s/^ *(foo)? +$_\n//mn; }

      caused it to slow down SO much.

      Anyway, I was satisfied with the performance I had before (without the test for foo), but your method is an order of magnitude faster than that "without the test" method, and it (of course) catches the rare case when foo is there, so I am extremely grateful for that. Thanks, again.

        Still don't understand why just changing the one line... caused it to slow down SO much.
        Just off the top of my head, I suspect it may be a bad interaction between the " *" and the " +" causing the regex engine to backtrack excessively because they're separated only by an optional element, so a string of multiple spaces can match in multiple ways (no spaces/5 spaces, 1 space/4 spaces, etc.) which then multiplies the number of potential matches for the full string, each of which needs to be evaluated until the engine is satisfied that it either found one that's good enough or that no match exists. The capturing parens on foo may also be contributing.

        If you want to test this theory, you could try changing the regex to $allfile =~ s/^( *foo)? +$_\n//mn; (leaving the capturing parens intact) or $allfile =~ s/^(?: *foo)? +$_\n//mn; (non-capturing parens, since you're really only using them for grouping) and seeing if that restores the original performance.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1232492]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2020-07-10 23:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?