Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

parsing of large files

by Anonymous Monk
on Mar 19, 2003 at 17:21 UTC ( [id://244391]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All.
I have a perl performance problem:
I need to cut huge files (up to 100MB, about 2 million lines. average size is about 600k lines ) into sections. A special tag appears at start and end of each section, while the section's size is , of course, unknown.

One option is to read the file, line-by-line, finding the start and end of each section (using regex, save states and so on...).
This option seems to be too slow considering the file's size.

Do you familiar with perl module that handles large files , or have any other idea ?

tnx,
Keren.

Replies are listed 'Best First'.
Re: parsing of large files
by TStanley (Canon) on Mar 19, 2003 at 17:32 UTC
    By setting the input line separator $/ to whatever the tag marking your sections is, you can slurp the entire section and treat it as a line.

    TStanley
    --------
    It is God's job to forgive Osama Bin Laden. It is our job to arrange the meeting -- General Norman Schwartzkopf
Re: parsing of large files
by arturo (Vicar) on Mar 19, 2003 at 17:40 UTC

    You don't need to have your script "remember" all the lines in the current section (which I take it is the issue you want to solve), as long as you are judicious in your use of filehandles. My initial thought is along the following lines:

    my $outfilename = "outfilename"; open INFILE, "$input_file" or die "Can't open $input_file: $!\n"; while (<INFILE>) { if ( /section-end-marker/ ) { close OUTFILE; next; } if ( /section-start-marker/ ) { # generate the new $outfilename however open OUTFILE, "> $outfilename" or die"$!\n"; next; } print OUTFILE; }

    That's very basic, but the idea is that you print to the currently open filehandle, unless you've found the start section marker, in which case you open the output file (to that filehandle), or the end section marker, in which case you close the curently open filehandle.

    HTH

    update OK, two people have failed to notice that the code is not to be used "as is": it is a skeleton upon which to build a functioning script. I left this implicit by putting comments where there would, in an actual script, be functioning code. I now make that implict warning explicit.

    If not P, what? Q maybe?
    "Sidney Morgenbesser"

      Hi arturo,

      The solution you presented will overwrite the output-file on each occurance of the section start. Furthermore, writing to closed filehandles isn't a very clean solution, IMHO.

      You could try to use the magic '..' operator:

      open OUT, '>&STDOUT' or die; while ( <DATA> ) { print OUT if /start-marker/../end-marker/ and !/(start-marker|end-marker)/; } close OUT or die; __DATA__ a b c start-marker d e end-marker f g start-marker h i end-marker j k
      This will print the lines within the markers (thus: d,e,h and i in my example) but ignores the markers.

      HTH,

      (update: fixed some layout issues and used the actual '..' operator instead of the '...' one!).

      -- JaWi

      "A chicken is an egg's way of producing more eggs."

        First, thanks!

        And: There is no "end-section". the end is the start of the next section and a "start-section" is one of 5-6 different tags that implemebt some kind of hierarchy between sections.
        I still can parse the big file into files but in any case I need to save a data struct with the files names, sections headers, etc. so it seems as a double wrok.

        I finally solved it by Tie::File module, reading "line by line" from the tie array, and save for each section its name, its place at hierarchy and start and end index.
        I ended up with one read of the whole file, and then direct access to each section by its name.

        However, I was worried about the size of the tie array, but I guess it won't be bigger than 4 or 8 bytes multiple by the number of lines. I can live with (and correct me if I'm wrong :-)).

        thanks again!
        Keren.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://244391]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-25 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found