parsing of large files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing of large files by TStanley (Canon) on Mar 19, 2003 at 17:32 UTC
By setting the input line separator `$/` to whatever the tag marking your sections is, you can slurp the entire section and treat it as a line. TStanley -------- It is God's job to forgive Osama Bin Laden. It is our job to arrange the meeting -- General Norman Schwartzkopf	[reply] [d/l]
Re: parsing of large files by arturo (Vicar) on Mar 19, 2003 at 17:40 UTC
You don't need to have your script "remember" all the lines in the current section (which I take it is the issue you want to solve), as long as you are judicious in your use of filehandles. My initial thought is along the following lines: `my $outfilename = "outfilename"; open INFILE, "$input_file" or die "Can't open $input_file: $!\n"; while (<INFILE>) { if ( /section-end-marker/ ) { close OUTFILE; next; } if ( /section-start-marker/ ) { # generate the new $outfilename however open OUTFILE, "> $outfilename" or die"$!\n"; next; } print OUTFILE; }` [download] That's very basic, but the idea is that you print to the currently open filehandle, unless you've found the start section marker, in which case you open the output file (to that filehandle), or the end section marker, in which case you close the curently open filehandle. HTH update OK, two people have failed to notice that the code is not to be used "as is": it is a skeleton upon which to build a functioning script. I left this implicit by putting comments where there would, in an actual script, be functioning code. I now make that implict warning explicit. If not P, what? Q maybe? "Sidney Morgenbesser"	[reply] [d/l]
Re: Re: parsing of large files by JaWi (Hermit) on Mar 19, 2003 at 18:14 UTC
Hi arturo, The solution you presented will overwrite the output-file on each occurance of the section start. Furthermore, writing to closed filehandles isn't a very clean solution, IMHO. You could try to use the magic '..' operator: `open OUT, '>&STDOUT' or die; while ( <DATA> ) { print OUT if /start-marker/../end-marker/ and !/(start-marker\|end-marker)/; } close OUT or die; __DATA__ a b c start-marker d e end-marker f g start-marker h i end-marker j k` [download] This will print the lines within the markers (thus: d,e,h and i in my example) but ignores the markers. HTH, (update: fixed some layout issues and used the actual '..' operator instead of the '...' one!). -- JaWi "A chicken is an egg's way of producing more eggs."	[reply] [d/l]
Re: Re: Re: parsing of large files by Anonymous Monk on Mar 20, 2003 at 13:11 UTC
First, thanks! And: There is no "end-section". the end is the start of the next section and a "start-section" is one of 5-6 different tags that implemebt some kind of hierarchy between sections. I still can parse the big file into files but in any case I need to save a data struct with the files names, sections headers, etc. so it seems as a double wrok. I finally solved it by Tie::File module, reading "line by line" from the tie array, and save for each section its name, its place at hierarchy and start and end index. I ended up with one read of the whole file, and then direct access to each section by its name. However, I was worried about the size of the tie array, but I guess it won't be bigger than 4 or 8 bytes multiple by the number of lines. I can live with (and correct me if I'm wrong :-)). thanks again! Keren.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks