skazat has asked for the wisdom of the Perl Monks concerning the following question:
yo yo yo
I'm finding and extracting a bunch of lines from a text file using the range operator like so:
my $stuff = <<EOF
blah blah blah
yada yadda yadd
START
My wonderful super duper
yummy booty bag
information
END
boring boring boring
EOF
;
my @lines = split("\n", $stuff);
my $good_stuff;
foreach(@lines){
if(m/START/ .. m/STOP/) {
$good_stuff .= "$_\n";
}
}
I'm working on optimizing my code, and was wondering if anyone has a super secret technique to extract lines between two delimiters that is faster than above technique, i would love some insight. This kind of parsing is being done many, many times in this application, so even a small optimization will pay off well,
take care,
-justin simoni
!skazat!
Re: Range This
by tadman (Prior) on Jul 10, 2001 at 23:56 UTC
|
Stuff the splitting and gluing. Just yank the whole thing
using a regex:
my ($good_stuff) = $stuff =~ /^START$(.*?)^END$/ms;
Don't forget that a better place for the semicolon after
EOF is on the first "here" line:
my $stuff = <<EOF; | [reply] [d/l] [select] |
|
Ok, looks good,
but how does this far for a whole bunch 'o text? I can see for one line, or a few lines, but I may be extracting text from something that's like 100, it really varies.
I'll do some tests and find out some stuff and report.
-justin simoni
!skazat!
| [reply] |
|
You're using a "here"-style document, so it will fit
in RAM, or your program won't run. If you were pulling
stuff from a file external to your program, you have a
few other options which might minimize memory usage, but
only if the stuff you're removing is significantly larger
than the stuff you're keeping.
The regex, as shown there, can handle any string you throw
at it, even if that string were several hundred meg of data.
Not that I would recommend using a here document that was
that large, but you get the idea.
| [reply] |
Re: Range This
by tachyon (Chancellor) on Jul 11, 2001 at 00:17 UTC
|
How much do I get for speeding it up 600% ?
<Update>
Looks like tadman beat me to the draw :-( Drat, missed it by that much chief.
</Update>
When in doubt use Benckmark. Here is a regex solution that is 6 times faster.
use Benchmark;
my $mine = <<'ME';
my $stuff = <<EOF;
boring boring boring
boring boring boring
boring boring boring
boring boring boring
boring boring boring
START
My wonderful super duper
yummy booty bag
information
END
boring boring boring
boring boring boring
boring boring boring
boring boring boring
boring boring boring
EOF
my ($good_stuff) = $stuff =~ m/(START.*END)/s;
# print $good_stuff;
ME
my $yours = <<'YOU';
my $stuff = <<EOF;
boring boring boring
boring boring boring
boring boring boring
boring boring boring
boring boring boring
START
My wonderful super duper
yummy booty bag
information
END
boring boring boring
boring boring boring
boring boring boring
boring boring boring
boring boring boring
EOF
my @lines = split("\n", $stuff);
my $good_stuff;
foreach(@lines){
if(m/START/ .. m/END/) {
$good_stuff .= "$_\n";
}
}
# print $good_stuff;
YOU
# prove they both work, uncomment the prints
# (we don't want to print when benchmarking)
# uncomment these evals
# eval $mine;
# eval $yours;
# Benchmark those suckers
timethese(100000,{'Mine' => $mine, 'Yours' => $yours});
C:\>perl test.pl
Benchmark: timing 100000 iterations of Mine, Yours...
Mine: 4 wallclock secs ( 3.90 usr + 0.00 sys = 3.90 CPU) @ 25
+641.03/s (
n=100000)
Yours: 24 wallclock secs (23.73 usr + 0.00 sys = 23.73 CPU) @ 42
+14.08/s (n
=100000)
C:\>
So the regex solution is about 6 times faster due no doubt to the fact that it uses more C and less Perl. You can get a whole file into a scalar by setting the input record separator to undef like so $/ = undef; then using $scalar = <FILE>; so this is quite practical if you have the memory. Oh this is the best way to undef the input record sepatator to avoid nasty suprises elsewhere in your code
{
local $/;
open FILE, "<$file";
$everything = <FILE>;
close FILE;
}
# out here the input record separator is still normal
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] [select] |
Re: Range This
by particle (Vicar) on Jul 11, 2001 at 00:37 UTC
|
1) forget split ~ why split, parse, then join?
2) forget m// ~ when you have index
3) generalize for more tags
try this~
sub find_between_tags($$$) {
# pass me the string to parse, the start tag, and the end tag
my $string_to_parse = shift;
my $start_tag = shift;
my $start_pos = index $stuff, $start_tag; # get string position of s
+tart tag
return unless $start_pos > -1; # return unless start tag
+found
my $stop_tag = shift;
my $stop_pos = index $stuff, $stop_tag; # get string position of e
+nd tag
$stop_pos > -1 ? # end tag found, return su
+bstring
return substr $stuff, $start_pos, # return substring
$stop_pos - $start_pos + length $stop_tag : # including matching
+ tags
return substr $stuff, $start_pos, -1; # return substring to end
}
print find_between_tags($stuff, "START", "STOP");
Update: thanks tachyon, for teaching me Benchmark!
here's the lowdown on original method vs. mine:
Benchmark: timing 100000 iterations of mine, original...
mine: 4 wallclock secs ( 3.28 usr + 0.00 sys = 3.28 CPU) @ 30
+534.35/s (n=100000)
original: 14 wallclock secs (13.12 usr + 0.00 sys = 13.12 CPU) @ 76
+22.53/s (n=100000)
~Particle | [reply] [d/l] [select] |
|
C:\>perl test.pl
Benchmark: timing 100000 iterations of Mine, Particle, Yours...
Mine: 5 wallclock secs ( 4.06 usr + 0.00 sys = 4.06 CPU) @ 24
+630.54/s (
n=100000)
Particle: 8 wallclock secs ( 7.20 usr + 0.00 sys = 7.20 CPU) @ 13
+888.89/s (
n=100000)
Yours: 25 wallclock secs (24.67 usr + 0.00 sys = 24.67 CPU) @ 40
+53.51/s (n
=100000)
C:\>
| [reply] [d/l] [select] |
Re: Efficiently Extracting a Range of Lines (was: Range This)
by Hofmator (Curate) on Jul 11, 2001 at 18:34 UTC
|
All of the answers so far have not correctly replicated
the behaviour of the original code - I don't know if this
is really an issue but multiple START and END (sic! and not STOP - there's a typo in the original post) aren't
matched. To do so I would suggest:
my @good_stuff = $stuff =~ /^START\n(.*?)^END$/gms;
# or alternatively with START and END around
my @good_stuff = $stuff =~ /(^START$.*?^END$)/gms;
This captures into the array @good_stuff and can then be
further processed. This is quite similar to the solution
from tadman. Note the /m modifier to let ^ and $
match inside the string just before and after a newline. The
/s modifier ensures that .*? matches everything including
newlines.
Another issue are the benchmarks done here in this
thread. These don't show anything, it depends on the
structure of the real data.
So to sum up, you should always run benchmark tests on some
real data to get an impression on how different
methods compare. Try different test strings both for
matching success and failure cases.
-- Hofmator
| [reply] [d/l] [select] |
|
i'm curious why you overlooked my solution. although it may not be the "best" way, my solution was created from lessons i learned from posts like Code Smarter, and Death to Dot Star!
although my solution breaks on multiple START/STOP tags, this requirement was not specified in the question. i would add this functionality for a more general solution, but i'd also need to know if it should handle nested tags or not.
my solution will, however, match START/STOP tags anywhere in the input stream, as was specified by the code in the original post. it will work if the STOP tag does not exist, as i got from the original data (granted this might be a typo). and it matches the behaviour of including the START/STOP tags in the results. mine includes it outputs a string, instead of a list, but that is easily remedied with split either in the return statements, or to be done outside the find_between_tags() function.
~Particle
| [reply] |
|
particle, I overlooked your solution on purpose ;-). That has absolutely nothing to do with the quality - all of them work fine on single START/END tags.
I just was not sure how quick index in comparison to the regex solutions works. The other two are both regex and thus easy to compare. I thought index should be quicker than a regex on a fixed string:
$i = index $stuff, 'START';
# compared to
$stuff =~ /START/;
but the benchmark suggested otherwise. The difference might be because of the function overhead and the surrounding code in your solution. But I was too lazy to
test this with some proper benchmarks ...
-- Hofmator
| [reply] [d/l] |
|
hmm, I didn't specify multiple start/stops in the string that will be given, but in the application, this is very much a possibility (and is really the norm) of what you will find, I actually, after all this, split the string I get at the END match. to make an array of these buggers.
The test example was pretty simplified, but in the Real World use of this, different START and END patterns are used, and these START and END parts are needed later down the line.
Thanks for all your help so far, I've found that this may not have been the crux of my script slowness, even though this actual parsing of chunks 'o text isn't very optimized, the swubroutine that this is housed is never called more than it needs to.
but still, converse among yourselves :)
-justin simoni
!skazat!
| [reply] |
|
|