Re: Out of memory problems

I think TedPride has the right idea. Setting $INPUT_RECORD_SEPERATOR to undefined (undef $/) has the effect of causing the while(<IN> statement to slurp the entire three+ gig file at once. From the docs

"Entirely undefining $/ makes the next line input operation slurp in the remainder of the file as one scalar value"

Set $/ to some reasonable size, maybe to satisfy memory limitations. Then you can match blocks of the file in much the way TedPride describes. Something like this

#undef $/;
$/ = \2048;         # 2K blocks
my $blocksz = 2048;

open IN, "tmp";
open OUT, ">>$ARGV[1]";   #more efficient when parsing more than one b
+lock

#Extract the necessary data bits
my $block01 = <IN>;
my $block02 = '';
while ($block02 = <IN>) {

     my $block = $block01.$block02;
     $block =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2
+/g;

#     $_ =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g;

#     $final - pack("B*", $_); #Conver data back to original binary fo
+rmat
#     $final = pack("B*", $block);  # this is wrong
     $final = pack("B*", substr($block,0,$blocksz)); # this should wor
+k

     print OUT "$final";

     $final = ''; # this is strictly unnecssary but does keep variable
+ clean
#     undef $final;

     $block01 = substr($block,-$blocksz);  # this moves the upper bloc
+k down 
}

$final = pack("B*", substr($block,-$blocksz)); # get final block
print OUT "$final";

close OUT;
close IN;
[download]

Update:

Corrected a couple lines in code

PJ
use strict; use warnings; use diagnostics;

Comment on Re: Out of memory problems Download Code

Replies are listed 'Best First'.
Re^2: Out of memory problems by tperdue (Sexton) on Oct 21, 2004 at 16:22 UTC
I gave the updated code a shot, however I'm still getting extra data added at the block boundaries. It's as if instead of using what's left at the end of a boundary it's adding more that that chunk. Any ideas?? If you think it's better to correspond via email please send me a note at tp@g-c-i.net	[reply]
Re^3: Out of memory problems by periapt (Hermit) on Oct 21, 2004 at 18:51 UTC
Actually, in looking at the code a second time, the problem is with the `$final = pack("B", $block);` statement. It should read `$final = pack("B", substr($block,0,BLOCKSZ);` Sorry about that. Please see amended code above. (I used a variable $blocksz in the code in place of BLOCKSZ in this discussion) `$block = $block01.$block02` creates a single variable, $block, of size BLOCKSZ * 2 (4096 in my code). The substitution works across the read boundary of 2048 between blocks 01 and 02 for this one instance The substitution will fail if the pattern crosses the upper boundery of $block02 since the pattern is incomplete. Thus, after writing out $block01, you move $block02 in to $block01 so that the next pattern substition will catch any pattern that crosses that boundary. Actually, come to think of it, you should be assigning the upper BLOCKSZ of $block to $block02 ie. `$block01 = substr($block,-BLOCKSZ)`. As for speed, you could increase the size of your blocks maybe to 32768 or 65536 or larger if you have the memory. You're using some pretty big sequences in the substitution regex, I wonder if that isn't your biggest bottleneck. Is it possible to break up your pattern into parts? You might pick up some speed there using several smaller substitutions rather than one big one. I'm not a regex guru (sort of a novice really) but it seems that there is the potential for a lot of backtracking in your regex and that has got to take time. Maybe one of the more experienced monks speak to that. The rest of the algorithm should be fairly quick. I would recommend that you move the file open operation `open OUT, ">>tmp";` (and the related close op) out of your first loop. That will cut some overhead opening and closing a file. Pack and Unpack are pretty efficient so you probably can't squeeze any more out of thos ops. I'm not sure if this matters any but you don't have to undef $array each time in the first loop. There is a little overhead involved in reinitializing $array each time. Setting `$array = ''` will accomplish the same thing without forcing the loop to recreate $array each time through. Every little bit adds up particulary when a loop repeats tens of thousands of times. I'll have to try benchmarking this sometime. Maybe after work ... Update: Running a simple benchmark on the undef vs nullifying produced this (786500 is approx the number of reads necessary to absorb a file of ~3Gb in 4K chunks). The second option runs about 17% faster on the first test. And the second compare testing the open and close op ran over 900% faster even on a short run of 3 CPU seconds use strict; use warnings; use diagnostics; use Benchmark qw(cmpthese); cmpthese(-60,{a=>sub{for (0..786500){my $array = '1'; undef $array;}}, b=>sub{for (0..786500){my $array = '1'; $array = ''; }}}) +; cmpthese(0,{a=>sub{for (0..10){my $array = '1'; open OUT, ">>tmp"; print OUT "$array"; undef $array; close OUT;}}, b=>sub{open OUT, ">>tmp"; for (0..10){my $array = '1'; print OUT "$array"; undef $array;}}}); + __END__ Benchmark: running a, b, each for at least 60 CPU seconds... a: 62 wallclock secs (60.50 usr + 0.00 sys = 60.50 CPU) @ 1 +.69/s (n=102) b: 64 wallclock secs (62.31 usr + 0.00 sys = 62.31 CPU) @ 1 +.97/s (n=123) Rate a b a 1.69/s -- -15% b 1.97/s 17% -- Benchmark: running a, b, each for at least 3 CPU seconds... a: 11 wallclock secs ( 0.03 usr + 3.75 sys = 3.78 CPU) @ 2 +.12/s (n=8) b: 8 wallclock secs ( 0.00 usr + 3.14 sys = 3.14 CPU) @ 23 +.24/s (n=73) Rate a b a 2.12/s -- -91% b 23.2/s 998% -- [download] PJ use strict; use warnings; use diagnostics;	[reply] [d/l] [select]
Re^4: Out of memory problems by tperdue (Sexton) on Oct 22, 2004 at 11:19 UTC
I gave this a shot this morning. Data is still being added somehow. It looks like at the memory boundaries data is being added. It's as if instead of adding what's left at the end of the previous boundary it's adding an entire chunk. Any ideas??	[reply]
Re^5: Out of memory problems by periapt (Hermit) on Oct 22, 2004 at 13:59 UTC
Re^6: Out of memory problems by tperdue (Sexton) on Oct 23, 2004 at 14:38 UTC
Re^6: Out of memory problems by tperdue (Sexton) on Oct 26, 2004 at 12:28 UTC
Re^4: Out of memory problems by tperdue (Sexton) on Oct 21, 2004 at 21:32 UTC
Thanks for the help. I'll give this a go first thing in the AM. Unfortunately I can't break up the search/replace sequence. I wish I could. It would probably make my work alot easier. I'll let you know how it goes. Thanks again.	[reply]


Perl Monk, Perl Meditation
	PerlMonks