Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^3: How to split big files with Perl ?

by Anonymous Monk
on Dec 28, 2014 at 04:22 UTC ( [id://1111525]=note: print w/replies, xml ) Need Help??


in reply to Re^2: How to split big files with Perl ?
in thread How to split big files with Perl ?

Thanks for taking the time to update. Some points to review:

  • Calling split_file recursively means that your stack will fill up as the number of chunks goes up. You've got one buffer per sub call, so that's probably the source of the memory usage and slowdown you reported.
  • Your algorithm/logic, even though it works, is confusing, and actually can possibly go wrong: Right after you read from the file, you use $iterator to determine whether to call split_file again - I think you need to look at $len first. Keeping a running count of the bytes written to the current chunk and comparing it to the desired chunk size might be better. Also, inside the while(1) loop, you don't seem to consider what happens after the call to split_file - the loop keeps going! In fact, if the file being split is exactly divisible by the chunk size, you create one final .splitNNN file that is empty.
  • This is not correct: open my $fh, '<', $_ || die "cannot open $_ $!";, since it gets parsed as open(my $fh, '<', ($_ || die("cannot open $_ $!"))); (you can see this by running perl -MO=Deparse,-p -e 'open my $fh, "<", $_ || die "cannot open $_ $!";'). Either write open my $fh, '<', $_ or die "cannot open $_ $!"; (or has lower precedence) or write open( my $fh, '<', $_ ) || die "cannot open $_ $!";
  • You're still not checking the return value of read, which is undef on error.
  • The code could also use a bit of cleanup. Just a couple of examples: The name $split_fh is a bit confusing, and you could append $num to it right away. In split_file you set $iterator = 0; but then don't use it in the recursive call to split_file.

I think this might be one of those situations where it would make sense to take a step back and try to work the best approach out without a computer - how would you solve this problem on paper?

But anyway, I am glad you took the time to work on and test your code! Tested code is important for a good post.

Replies are listed 'Best First'.
Re^4: How to split big files with Perl ?
by james28909 (Deacon) on Dec 28, 2014 at 06:49 UTC
    Yeah, that is not something i am sure about is memory management. Perl is my first language and so far it is the only language i use. The significant slowdown can be fized by usung a small value as a read length, but that does not output fast enough. There is still alot i am not completely positive about, like when you says "your stack will fill up", do you mean the memory?

    As for the logic it is pretty straight forward (or what i thought was ;) ), the iterator is what actually sets the size in which you want to split the file, so doubling it will actual make it split the file into 4gb chunks, and once the iterator hits its mark, it calls the sub again, until $buf != read length (which was the only way i knew of to check for eof.)

    If you set the iterator to a higher value you ofcourse need to adjust the read length of $buf. With that said, What would be a better way to check $buf for end of file? And thanks for pointing all this out to me :)

      Other people have explained the concepts elsewhere, for example one place to start is Wikipedia: see stack and recursion. But the (very) simplified idea in this case is this: When a sub foo calls a sub bar, the state of foo has to be saved somewhere (the stack) so that when bar returns, foo can continue where it left off. This is true for every sub call, even when a sub calls itself (that's recursion). So for every time split_file is called, a new $buf variable is kept on the stack, taking up the memory. The alternative approach is to not use recursion, and instead do everything in a single loop.

      See the documentation of read: it returns zero when the end-of-file is reached. There's also the eof function, but that's rarely needed since usually the return value of read is enough. There is also one more thing to know: In some cases, like when reading from a serial port or network connection, it's possible for read to return less than the requested number of bytes without it always meaning end-of-file or an error. But that case is extremely unlikely for reading files from a disk (maybe impossible, I'm not sure on the internals there).

      Anyway, the way I would think about the algorithm is this: The central thing in the program is the number of bytes written to each chunk. read returns the number of bytes read, therefore the number of bytes to be written to the current file, so that's what we use to keep track of how far we are in each current chunk, and make the decision of whether to start a new chunk or not based on that. You would also need to cover the cases of read returning undef (read error) and read returning zero (end-of-file).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1111525]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 20:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found