Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: File splitting help

by monarch (Priest)
on Jan 20, 2009 at 21:47 UTC ( [id://737695]=note: print w/replies, xml ) Need Help??


in reply to File splitting help

You possibly want to investigate the use of the seek or sysseek functions.

Either way you're going to have to read chunks of data and write chunks of data.

Suggestion, then: use read or sysread to pull in large chunks of the file at a time, say 64 kilobytes, into a buffer. Keep a counter of what position you are in the file. Write that buffer to your chunk file, however if the counter exceeds your chunk length (e.g. 400MB) then scan backwards for the last newline character using rindex. Flush the initial portion of that buffer, then close your file, and flush the last portion of that buffer to a new chunk file, reset your chunk length counter, and continue.

Some psuedo-code (this is _not_ Perl):

while ( ! eof ) { chunklen = 0; chunknum = 0; open( FOUT, ">chunk" . chunknum++ ); # read into buffer, but at end of buffer in case of leftovers while ( len = read( FIN, buffer, 64000, length(buffer) ) ) { if ( chunklen + len > 400MB ) { # got to end of chunk, deal with newline lastnewline = rindex( buffer, "\n" ); if ( lastnewline ) { # flush up to last found newline write( FOUT, substr( buffer, 0, lastnewline ) ); substr( buffer, 0, lastnewline ) = ""; close( FOUT ); last; # skip to next file } else { # flush entire buffer (no newline found) write( FOUT, buffer ); buffer = ""; } } else { # not at end of chunk, just write buffer print( FOUT, buffer ); buffer = ""; chunklen += len; } } # while we've got something to read } # while not at eof of input

Update: had to ensure read was to end of buffer, close chunk file when done

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://737695]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2024-04-19 15:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found