Parse string greater than 2GB

BigHoss has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parse string greater than 2GB by rjt (Curate) on Jun 30, 2013 at 01:57 UTC
`foreach my $l (split("\n", $map)) { print $l; }` [download] It seems to me your code is doing nothing more than printing out the input with newlines removed. You can achieve the same result by removing all newlines with: `$map =~ y/\n//r;` For me this took a few seconds on a 2GiB + 16 byte string (whereas creating the same string with the repetition operator took more than twice as long, and that was without any IO). Your approach with split runs out of memory on my 4GiB VM, because `split` generates a new list with new strings, more than doubling the memory requirement (depending on density of newlines). I strongly suspect, however, that even if it worked, the `split` would be much slower. I also wonder if this may be an XY Problem: You say the `read` cannot be changed, and try as I might, I can't imagine why you'd want to read a huge binary file and print out everything but the newlines. If my advice doesn't hit the mark, can you give us a few more details on what it is you're doing? `open (INFILE, "$FILE") \|\| die "Not able to open the file: $FILE \n";` Be careful with open. If you ever intend `$FILE` to be user-specified (and even if you don't), I'd recommend using the 3-argument open: `open INFILE, '<', $FILE or die "Not able...";` See Two-arg open() considered dangerous. I'd also use a lexical filehandle (`open my $infile, ...`) instead of `INFILE`.	[reply] [d/l] [select]
Re: Parse string greater than 2GB by kcott (Archbishop) on Jun 30, 2013 at 02:15 UTC
G'day BigHoss, Welcome to the monastery. You've written 'the "read()" of the file cannot be changed.' and then, in your code, you've shown what happens when you do change it. So, please clarify what you mean; I can provide a few suggestions but, until that ambiguity is sorted out, I'm really just guessing. The error you're getting is described in perldiag. Reading the entire file and then looping through the output from split can be achieved more simply with code like this: `while (my $l = <INFILE>) { chomp $l; print $l; }` [download] You'll probably find that passing a lexical filehandle (see open) to `BigParse()` is easier than dealing with globrefs. Check the read documentation and `man wc` for discrepancies between what each considers a character and a byte to be. sysopen and sysread may be better options for dealing with your binary data. -- Ken	[reply] [d/l] [select]
Re: Parse string greater than 2GB by thomas895 (Deacon) on Jun 30, 2013 at 02:44 UTC
In a one-liner: `$ perl -pe 's/\n//' /path/to/data` [download] Of course this does mean you don't get the length shown. But that is an easy fix, simply pipe it into `wc -`. ~Thomas~ "Excuse me for butting in, but I'm interrupt-driven..."	[reply] [d/l] [select]
Re^2: Parse string greater than 2GB by rjt (Curate) on Jun 30, 2013 at 09:32 UTC
You wrote: `$ perl -pe 's/\n//' /path/to/data` Your approach reads the data file (via standard input) one line at a time (delimited by newlines), and searches every line in its entirety to replace one newline character before the implicit `-p` loop prints them out. One can accomplish the same thing in about half the CPU time (depending on average line length) with: `$ perl -pe chomp /path/to/data` The OP also indicated that they have to stick with the `read()` loop, so it's worth noting that solutions like these that read line by line don't fit the problem description. (Not that I don't have some significant doubts about the problem description...)	[reply] [d/l] [select]
Re: Parse string greater than 2GB by Laurent_R (Canon) on Jun 30, 2013 at 08:22 UTC
Data File is binary file with embedded newline characters "\n". This sounds a little bit bad. If your data is really binary, then it is quite likely that some of the bytes will by accident have the value of new line characters in your system. How can you tell the difference between actual new lines and binary bytes that happen to have the value of a new line character? Reading the file line by line is probably not an option in this case. It probably does not matter too much if all what you want to do is to print the data, but it does if you want to do any more subtile processing.	[reply]
Re: Parse string greater than 2GB by swampyankee (Parson) on Jun 30, 2013 at 11:01 UTC
While posting fragments of code is nice, it's even nicer to explain what you're trying to do. From your explanation and your code fragment, it seems that you need not do anything except `open(my $input,"<",$input_file) or die "Could not open $input_file bec +ause $!\n"; while(<$input>) { print; }` [download] So, what's the point? Is this some way of writing od in Perl? Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply] [d/l]
Re: Parse string greater than 2GB by kcott (Archbishop) on Jul 04, 2013 at 01:31 UTC
I've just come across this in perl-5.19.1 > perldelta: Selected Bug Fixes which may be related to your problem. "Fixed a small number of regexp constructions that could either fail to match or crash perl when the string being matched against was allocated above the 2GB line on 32-bit systems. [RT #118175]" Note: I haven't investigated further. It may be completely unrelated. -- Ken	[reply]
Re: Parse string greater than 2GB by sundialsvc4 (Abbot) on Jul 01, 2013 at 11:23 UTC
Unless it is reasonably possible that “the single thing that you are looking for” is actually ≥ 2GB in size by itself, then you will be, one way or the other, reading it in some more conveniently-sized sections and in some suitable way dealing with the “fragments” that are left-over at the end of each read. (You move this unused portion to the start of your buffer, read more data to fill it up again, and keep going.) If you can identify a record separator to Perl (it doesn’t have to be `\n`), Perl will even do a lot of the leg-work for you, using its own buffering scheme. One way that is sometimes useful to deal with very large static files is to memory-map them, e.g. PerlIO::mmap (or any of 64-or-so other packages I found in http://search.cpan.org using the key, “mmap.”) This technique uses the operating system’s virtual memory subsystem to do some of the dirty-work for you, by mapping a portion of the file (a movable “window” into it, of some selected-but-not-2GB size) into the process’s virtual memory address space ... this avoids copying. But you still can’t map “all of” a very large file.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks