in reply to How to split big files with Perl ?

Get the length of the file, divide that by how many times you want to split it, then read it into a buffer and write it to a file :)
use strict; use warnings; open my $fh, '<', 'filename.dat'; binmode($fh); my $len = -s $fh; my $split_length = $length / 5; #would split 10gb into 2gb chunks my $split_fh = $fh . 'split'; #creates 'filename.split' my $num = '1'; for ( 1 .. 5 ) { read $fh, $buf, $split_length; open my $out_file, '>', $split_fh . $num; #vreates '$filename.spli +t000, 001, 002 ect binmode($out_file); print $out_file, $buf; close($out_file); $num++; } close($fh);
I am sure there are other ways to do it. It is completely untested code.

Replies are listed 'Best First'.
Re^2: How to split big files with Perl ?
by Anonymous Monk on Dec 26, 2014 at 18:48 UTC

    I'm sorry but this is really not good.

    Aside from the fact that it doesn't compile, what is my $split_fh = "$fh" . 'split'; supposed to do? print $buf $outfile; or opening $out_file in read mode are pretty obvious errors. No error handling on open or read is also not great.

    Do you think that reading 2GB of the input file into memory at a time is a very efficient way to go about it?

    What happens when the size of the file is not exactly divisible by 5?

      Well honestly it was like I said, it was purely untested code and was just for an example. I did not intend on it being a copy and paste example. All this does is reads the file then makes another file on the fly appending 001++ to it, thats all. I will however revise it and make any corrections so user can copy and paste it.
      I stand corrected, it will take more than what i posted to be able to split it up. What I was considering was takeing a 10gb file, and split it into exactly 4gb chunks. I think that would require to read 1 byte at a time, and write the buf to outfile until a counter reaches the 4gb limit. That way it is not filling the memory with all this data at one time and would work smoothly. Ill see what i can cook up.

        So perhaps you should adopt a policy of testing before posting?

        /me thinks so.



        check Ln42!

        This code has too many obvious issues to be left without comment.

        Actually too many to be even listed.

        That's not a bashing - it's a warning for everybody who tries to copy that.

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

        I see you've updated your original node, but haven't marked it as updated. Please don't do that since now some of the replies don't make sense anymore, see here.

        This is still not correct: my $split_fh = $fh . 'split'; #creates 'filename.split' - don't use $fh, the filehandle, use a variable containing the filename here. Also print $out_file, $buf; - remove the comma.

        Reading the file one byte at a time is a slightly better approach than 2GB at a time, but probably still not very efficient - a buffer size of at least a few kilobytes will probably get you better performance.

        Pseudocode is fine to demonstrate a concept, unfortunately, your concept needs some improvements to be practical, as mentioned above.

        I will however revise it and make any corrections so user can copy and paste it.

        I think that is an excellent exercise!

Re^2: How to split big files with Perl ?
by james28909 (Deacon) on Dec 27, 2014 at 08:31 UTC

    This works much better :)

    This splits the file into 2gb chunks, I have tested on about 25-30 iso's is have stored on my PC and it works great, though sometimes writing performance is a little bit slow. You can also change how many gb's you want to split it into by changing the iterator's value.
    use strict; use warnings; files(); sub files { foreach (@ARGV) { print "processing $_\n"; open my $fh, '<', $_ || die "cannot open $_ $!"; binmode($fh); my $num = '000'; my $iterator = 0; split_file( $fh, $num, $_, $iterator ); } } sub split_file { my ( $fh, $num, $name, $iterator ) = @_; my $split_fh = "$name" . '.split'; open( my $out_file, '>', $split_fh . $num ) || die "cannot ope +n $split_fh$num $!"; binmode($out_file); while (1) { $iterator++; my $buf; read( $fh, $buf, 32 ); print( $out_file $buf ); my $len = length $buf; if ( $iterator == 67108864 ) { #split into 2gb chun +ks $iterator = 0; $num++; split_file( $fh, $num, $name ); } elsif ( $len !~ "32" ) { last; } } }
    Works pretty quickly! split almost 5gb in 4.4333 mins. I do see a decrease in performance sometimes, though other times it writes very quickly. Go ahead and test it on one of your iso's. What would be the most efficient read/write buffer?

      The most efficient block size will depend on lots of things, but the memory page size of your OS will likely be the most significant. 32 bytes is way too small, I'd start with 4k or 8k and go up from there. Why not try several different multiples of 4K and see which one works best for you?

      Also, read returns the number of bytes actually read so there's really no need to use length.

      my $len = read($in,$buf,4*1024); ...

      And $len is an integer so it would be better to use the numeric not equal '!=' rather than the pattern match operator.

        Well i did turn up the speed some, but I would watch my memory fill up as it was running. It would punch out a 2GB file in no time (less that 10 secs or so) but then I would see a dramatic slowdown, as in it would only be writing KB/s instead of MB/s. I will try your suggestion as well, thanks

        Also, $len = length $buf; gets the length of $buf, then later on checks and makes sure it is the same size as the read length. If it is not the same size, then that is more than likely the end of the file. I need to figure out a better way to check for end of file actually.

      Thanks for taking the time to update. Some points to review:

      • Calling split_file recursively means that your stack will fill up as the number of chunks goes up. You've got one buffer per sub call, so that's probably the source of the memory usage and slowdown you reported.
      • Your algorithm/logic, even though it works, is confusing, and actually can possibly go wrong: Right after you read from the file, you use $iterator to determine whether to call split_file again - I think you need to look at $len first. Keeping a running count of the bytes written to the current chunk and comparing it to the desired chunk size might be better. Also, inside the while(1) loop, you don't seem to consider what happens after the call to split_file - the loop keeps going! In fact, if the file being split is exactly divisible by the chunk size, you create one final .splitNNN file that is empty.
      • This is not correct: open my $fh, '<', $_ || die "cannot open $_ $!";, since it gets parsed as open(my $fh, '<', ($_ || die("cannot open $_ $!"))); (you can see this by running perl -MO=Deparse,-p -e 'open my $fh, "<", $_ || die "cannot open $_ $!";'). Either write open my $fh, '<', $_ or die "cannot open $_ $!"; (or has lower precedence) or write open( my $fh, '<', $_ ) || die "cannot open $_ $!";
      • You're still not checking the return value of read, which is undef on error.
      • The code could also use a bit of cleanup. Just a couple of examples: The name $split_fh is a bit confusing, and you could append $num to it right away. In split_file you set $iterator = 0; but then don't use it in the recursive call to split_file.

      I think this might be one of those situations where it would make sense to take a step back and try to work the best approach out without a computer - how would you solve this problem on paper?

      But anyway, I am glad you took the time to work on and test your code! Tested code is important for a good post.

        Yeah, that is not something i am sure about is memory management. Perl is my first language and so far it is the only language i use. The significant slowdown can be fized by usung a small value as a read length, but that does not output fast enough. There is still alot i am not completely positive about, like when you says "your stack will fill up", do you mean the memory?

        As for the logic it is pretty straight forward (or what i thought was ;) ), the iterator is what actually sets the size in which you want to split the file, so doubling it will actual make it split the file into 4gb chunks, and once the iterator hits its mark, it calls the sub again, until $buf != read length (which was the only way i knew of to check for eof.)

        If you set the iterator to a higher value you ofcourse need to adjust the read length of $buf. With that said, What would be a better way to check $buf for end of file? And thanks for pointing all this out to me :)