Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Efficiently Extracting a Range of Lines (was: Range This)

by skazat (Chaplain)
on Jul 10, 2001 at 23:21 UTC ( [id://95436]=perlquestion: print w/replies, xml ) Need Help??

skazat has asked for the wisdom of the Perl Monks concerning the following question:

yo yo yo I'm finding and extracting a bunch of lines from a text file using the range operator like so:

my $stuff = <<EOF blah blah blah yada yadda yadd START My wonderful super duper yummy booty bag information END boring boring boring EOF ; my @lines = split("\n", $stuff); my $good_stuff; foreach(@lines){ if(m/START/ .. m/STOP/) { $good_stuff .= "$_\n"; } }

I'm working on optimizing my code, and was wondering if anyone has a super secret technique to extract lines between two delimiters that is faster than above technique, i would love some insight. This kind of parsing is being done many, many times in this application, so even a small optimization will pay off well,

take care,

-justin simoni
!skazat!

Replies are listed 'Best First'.
Re: Range This
by tadman (Prior) on Jul 10, 2001 at 23:56 UTC
    Stuff the splitting and gluing. Just yank the whole thing using a regex:
    my ($good_stuff) = $stuff =~ /^START$(.*?)^END$/ms;
    Don't forget that a better place for the semicolon after EOF is on the first "here" line:
    my $stuff = <<EOF;
      Ok, looks good,

      but how does this far for a whole bunch 'o text? I can see for one line, or a few lines, but I may be extracting text from something that's like 100, it really varies.

      I'll do some tests and find out some stuff and report.

       

      -justin simoni
      !skazat!

        You're using a "here"-style document, so it will fit in RAM, or your program won't run. If you were pulling stuff from a file external to your program, you have a few other options which might minimize memory usage, but only if the stuff you're removing is significantly larger than the stuff you're keeping.

        The regex, as shown there, can handle any string you throw at it, even if that string were several hundred meg of data. Not that I would recommend using a here document that was that large, but you get the idea.
Re: Range This
by tachyon (Chancellor) on Jul 11, 2001 at 00:17 UTC

    How much do I get for speeding it up 600% ?

    <Update>

    Looks like tadman beat me to the draw :-( Drat, missed it by that much chief.

    </Update>

    When in doubt use Benckmark. Here is a regex solution that is 6 times faster.

    use Benchmark; my $mine = <<'ME'; my $stuff = <<EOF; boring boring boring boring boring boring boring boring boring boring boring boring boring boring boring START My wonderful super duper yummy booty bag information END boring boring boring boring boring boring boring boring boring boring boring boring boring boring boring EOF my ($good_stuff) = $stuff =~ m/(START.*END)/s; # print $good_stuff; ME my $yours = <<'YOU'; my $stuff = <<EOF; boring boring boring boring boring boring boring boring boring boring boring boring boring boring boring START My wonderful super duper yummy booty bag information END boring boring boring boring boring boring boring boring boring boring boring boring boring boring boring EOF my @lines = split("\n", $stuff); my $good_stuff; foreach(@lines){ if(m/START/ .. m/END/) { $good_stuff .= "$_\n"; } } # print $good_stuff; YOU # prove they both work, uncomment the prints # (we don't want to print when benchmarking) # uncomment these evals # eval $mine; # eval $yours; # Benchmark those suckers timethese(100000,{'Mine' => $mine, 'Yours' => $yours}); C:\>perl test.pl Benchmark: timing 100000 iterations of Mine, Yours... Mine: 4 wallclock secs ( 3.90 usr + 0.00 sys = 3.90 CPU) @ 25 +641.03/s ( n=100000) Yours: 24 wallclock secs (23.73 usr + 0.00 sys = 23.73 CPU) @ 42 +14.08/s (n =100000) C:\>

    So the regex solution is about 6 times faster due no doubt to the fact that it uses more C and less Perl. You can get a whole file into a scalar by setting the input record separator to undef like so $/ = undef; then using $scalar = <FILE>; so this is quite practical if you have the memory. Oh this is the best way to undef the input record sepatator to avoid nasty suprises elsewhere in your code

    { local $/; open FILE, "<$file"; $everything = <FILE>; close FILE; } # out here the input record separator is still normal

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Range This
by particle (Vicar) on Jul 11, 2001 at 00:37 UTC
    1) forget split ~ why split, parse, then join?

    2) forget m// ~ when you have index

    3) generalize for more tags

    try this~

    sub find_between_tags($$$) { # pass me the string to parse, the start tag, and the end tag my $string_to_parse = shift; my $start_tag = shift; my $start_pos = index $stuff, $start_tag; # get string position of s +tart tag return unless $start_pos > -1; # return unless start tag +found my $stop_tag = shift; my $stop_pos = index $stuff, $stop_tag; # get string position of e +nd tag $stop_pos > -1 ? # end tag found, return su +bstring return substr $stuff, $start_pos, # return substring $stop_pos - $start_pos + length $stop_tag : # including matching + tags return substr $stuff, $start_pos, -1; # return substring to end } print find_between_tags($stuff, "START", "STOP");

    Update:

    thanks tachyon, for teaching me Benchmark!

    here's the lowdown on original method vs. mine:

    Benchmark: timing 100000 iterations of mine, original... mine: 4 wallclock secs ( 3.28 usr + 0.00 sys = 3.28 CPU) @ 30 +534.35/s (n=100000) original: 14 wallclock secs (13.12 usr + 0.00 sys = 13.12 CPU) @ 76 +22.53/s (n=100000)

    ~Particle

      As requested by /msg here is a comparison of the three methods so far using Benchmark. Particle's method is only 50% slower than a regex so it kicks a$$ too at more than three times faster than the range method. TIMTOWDI. For those who would like to learn a really easy cut'n'paste way to use Benchmark (like here) check out Benchmark made easy

      C:\>perl test.pl Benchmark: timing 100000 iterations of Mine, Particle, Yours... Mine: 5 wallclock secs ( 4.06 usr + 0.00 sys = 4.06 CPU) @ 24 +630.54/s ( n=100000) Particle: 8 wallclock secs ( 7.20 usr + 0.00 sys = 7.20 CPU) @ 13 +888.89/s ( n=100000) Yours: 25 wallclock secs (24.67 usr + 0.00 sys = 24.67 CPU) @ 40 +53.51/s (n =100000) C:\>
Re: Efficiently Extracting a Range of Lines (was: Range This)
by Hofmator (Curate) on Jul 11, 2001 at 18:34 UTC

    All of the answers so far have not correctly replicated the behaviour of the original code - I don't know if this is really an issue but multiple START and END (sic! and not STOP - there's a typo in the original post) aren't matched. To do so I would suggest:

    my @good_stuff = $stuff =~ /^START\n(.*?)^END$/gms; # or alternatively with START and END around my @good_stuff = $stuff =~ /(^START$.*?^END$)/gms;
    This captures into the array @good_stuff and can then be further processed. This is quite similar to the solution from tadman. Note the /m modifier to let ^ and $ match inside the string just before and after a newline. The /s modifier ensures that .*? matches everything including newlines.

    Another issue are the benchmarks done here in this thread. These don't show anything, it depends on the structure of the real data.

    • The solution from tachyon - apart from breaking on multiple START/END combinations - is slow on something like
      $stuff = << EOF; blah START a few lines END lots of lots of lines EOF
      as it has to backtrack from the end of the string.
    • Something like tadman's solution works well on the example above but is slower than tachyon's on this
      $stuff = << EOF; blah START lots of lots of lines END blah EOF
      as the non-greedy .*? matching advances slower than greedy .*
    So to sum up, you should always run benchmark tests on some real data to get an impression on how different methods compare. Try different test strings both for matching success and failure cases.

    -- Hofmator

      i'm curious why you overlooked my solution. although it may not be the "best" way, my solution was created from lessons i learned from posts like Code Smarter, and Death to Dot Star!

      although my solution breaks on multiple START/STOP tags, this requirement was not specified in the question. i would add this functionality for a more general solution, but i'd also need to know if it should handle nested tags or not.

      my solution will, however, match START/STOP tags anywhere in the input stream, as was specified by the code in the original post. it will work if the STOP tag does not exist, as i got from the original data (granted this might be a typo). and it matches the behaviour of including the START/STOP tags in the results. mine includes it outputs a string, instead of a list, but that is easily remedied with split either in the return statements, or to be done outside the find_between_tags() function.

      ~Particle

        particle, I overlooked your solution on purpose ;-). That has absolutely nothing to do with the quality - all of them work fine on single START/END tags.

        I just was not sure how quick index in comparison to the regex solutions works. The other two are both regex and thus easy to compare. I thought index should be quicker than a regex on a fixed string:

        $i = index $stuff, 'START'; # compared to $stuff =~ /START/;
        but the benchmark suggested otherwise. The difference might be because of the function overhead and the surrounding code in your solution. But I was too lazy to test this with some proper benchmarks ...

        -- Hofmator

      hmm, I didn't specify multiple start/stops in the string that will be given, but in the application, this is very much a possibility (and is really the norm) of what you will find, I actually, after all this, split the string I get at the END match. to make an array of these buggers.

      The test example was pretty simplified, but in the Real World use of this, different START and END patterns are used, and these START and END parts are needed later down the line.

      Thanks for all your help so far, I've found that this may not have been the crux of my script slowness, even though this actual parsing of chunks 'o text isn't very optimized, the swubroutine that this is housed is never called more than it needs to.

      but still, converse among yourselves :)

       

      -justin simoni
      !skazat!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://95436]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (1)
As of 2024-04-25 04:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found