Splitting Large File into many small ones.

disciple has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script to split a large file into many small files. The large file is 1.8 gigabytes, or approximately 6.4 million lines. The objective is to break it into chunks so we can open the data and see what it contains because our machines cannot handle opening text files that large.

I have written a script that works but would like some feedback on better ways to accomplish the same thing. Maybe even faster ways.

use strict;
use warnings;

my $source = shift or &usage();
my $lines_per_file = shift or &usage();

open (my $FH, "<$source") or die "Could not open source file. $!";
open (my $OUT, ">00000000.log") or die "Could not open destination fil
+e. $!";

my $i = 0;

while (<$FH>) {
    print $OUT $_;
    $i++;

    if ($i % $lines_per_file == 0) {
        close($OUT);
        my $FHNEW = sprintf("%08d", $i);
        open ($OUT, ">${FHNEW}.log") or die "Could not open destinatio
+n file. $!";
    }
}

sub usage() {
print <<EOF;

    PROGRAM NAME: Partition File
    
    DESCRIPTION:
    Takes a file and creates many small files out of the large file.
    
    EXAMPLE USAGE: partition_file.pl log.txt 1000
    
    PARAMETERS:
    1. Source File: File name of the source file to partition.
    2. Maximum number of lines per file: The number of lines per file.

EOF
exit;
}
[download]

Thanks fellow monks!
Best Regards,
disciple
thePfeiffers.net

Comment on Splitting Large File into many small ones. Download Code

Replies are listed 'Best First'.
Re: Splitting Large File into many small ones. by Corion (Patriarch) on Dec 08, 2003 at 21:52 UTC
If you are under any system with the Unix toolset available, the `split` command already does this and more. If you're bent on using Perl, the Perl Power Tools also have a pure Perl implementation of split. Personally, I would make the script more flexible by parsing the options via GetOpt::Long and having the input file as the last parameter. Having it as the last parameter allows the script to be used within a shell pipeline like `gunzip -c my_file \| perl -w partition_file.pl 1000`, and it makes the rest of the parameters independent from their position. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]
Re: Splitting Large File into many small ones. by pg (Canon) on Dec 08, 2003 at 21:55 UTC
Read line by line is not a good idea. You should use read() to read in a chunk, then do one <> to make sure the small files end at line ends. (I tested the demo with a big log file, the performance is very good.) `use strict; use warnings; open (FH, "<message_log01") or die "Could not open source file. $!"; my $i = 0; while (1) { my $chunk; print "process part $i\n"; open(OUT, ">part$i.log") or die "Could not open destination file"; $i ++; if (!eof(FH)) { read(FH, $chunk, 1000000); print OUT $chunk; } if (!eof(FH)) { $chunk = <FH>; print OUT $chunk; } close(OUT); last if eof(FH); }` [download]	[reply] [d/l]
Re: Re: Splitting Large File into many small ones. by disciple (Pilgrim) on Dec 09, 2003 at 04:31 UTC
Thanks for that. I have not had a chance to run the code, but will do so tomorrow. disciple	[reply]
Re: Splitting Large File into many small ones. by Paulster2 (Priest) on Dec 08, 2003 at 21:54 UTC
What kind of system are you running. I know that on a UNIX system you can do this through the command line. I think it's called split, but I'm not sure off the top of my head. It does it really fast though. UPDATE: Went out and looked. It is split that does it on UNIX, with a caveat. It parses out using what ever filename you give it with the extensions being (filename)aa through (filename)zz. If you don't provide a filename it uses x as the filename so you end up with xaa through xzz, for a grand total of 676 files max. You can, however, split those 676 files into smaller files using the same methodry. The default file length that split gives is 1000 lines, so you might have to do some math and adjust accordingly to your file size. Paulster2	[reply]
Re: Splitting Large File into many small ones. by disciple (Pilgrim) on Dec 08, 2003 at 22:40 UTC
I am running a windows box. Thanks.	[reply]
Re: Re: Splitting Large File into many small ones. by BrowserUk (Patriarch) on Dec 08, 2003 at 23:45 UTC
Update: I forgot to paste the link. D'oh! In that case, you will probably find this useful. They run markedly quicker than the perl ports or cygwin versions being natively compiled. P:\test>split --help Usage: split [OPTION] [INPUT [PREFIX]] Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default PREFIX is `x'. With no INPUT, or when INPUT is -, read standard input +. -b, --bytes=SIZE put SIZE bytes per output file -C, --line-bytes=SIZE put at most SIZE bytes of lines per output f +ile -l, --lines=NUMBER put NUMBER lines per output file -NUMBER same as -l NUMBER --verbose print a diagnostic to standard error just before each output file is opened --help display this help and exit --version output version information and exit SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg. Report bugs to <bug-textutils@gnu.org>. [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l]
Re: Splitting Large File into many small ones. by disciple (Pilgrim) on Dec 09, 2003 at 02:16 UTC
I probably should have clarified one thing. At this point, the work is done and I have no immediate need to do anything else with it. It is purely an exercise now to learn new and better ways of implementing the same thing. Thanks all for your comments. disciple	[reply]
Re: Re: Splitting Large File into many small ones. by BrowserUk (Patriarch) on Dec 09, 2003 at 02:42 UTC
In that case, about the comment I would make about your original is that you could have used $. instead of $i to do the counting. Whether that would be better in anyway is mute though. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]