disciple has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script to split a large file into many small files. The large file is 1.8 gigabytes, or approximately 6.4 million lines. The objective is to break it into chunks so we can open the data and see what it contains because our machines cannot handle opening text files that large.

I have written a script that works but would like some feedback on better ways to accomplish the same thing. Maybe even faster ways.

use strict; use warnings; my $source = shift or &usage(); my $lines_per_file = shift or &usage(); open (my $FH, "<$source") or die "Could not open source file. $!"; open (my $OUT, ">00000000.log") or die "Could not open destination fil +e. $!"; my $i = 0; while (<$FH>) { print $OUT $_; $i++; if ($i % $lines_per_file == 0) { close($OUT); my $FHNEW = sprintf("%08d", $i); open ($OUT, ">${FHNEW}.log") or die "Could not open destinatio +n file. $!"; } } sub usage() { print <<EOF; PROGRAM NAME: Partition File DESCRIPTION: Takes a file and creates many small files out of the large file. EXAMPLE USAGE: partition_file.pl log.txt 1000 PARAMETERS: 1. Source File: File name of the source file to partition. 2. Maximum number of lines per file: The number of lines per file. EOF exit; }

Thanks fellow monks!
Best Regards,
disciple
thePfeiffers.net

Replies are listed 'Best First'.
Re: Splitting Large File into many small ones.
by Corion (Patriarch) on Dec 08, 2003 at 21:52 UTC

    If you are under any system with the Unix toolset available, the split command already does this and more. If you're bent on using Perl, the Perl Power Tools also have a pure Perl implementation of split.

    Personally, I would make the script more flexible by parsing the options via GetOpt::Long and having the input file as the last parameter. Having it as the last parameter allows the script to be used within a shell pipeline like gunzip -c my_file | perl -w partition_file.pl 1000, and it makes the rest of the parameters independent from their position.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Splitting Large File into many small ones.
by pg (Canon) on Dec 08, 2003 at 21:55 UTC

    Read line by line is not a good idea. You should use read() to read in a chunk, then do one <> to make sure the small files end at line ends. (I tested the demo with a big log file, the performance is very good.)

    use strict; use warnings; open (FH, "<message_log01") or die "Could not open source file. $!"; my $i = 0; while (1) { my $chunk; print "process part $i\n"; open(OUT, ">part$i.log") or die "Could not open destination file"; $i ++; if (!eof(FH)) { read(FH, $chunk, 1000000); print OUT $chunk; } if (!eof(FH)) { $chunk = <FH>; print OUT $chunk; } close(OUT); last if eof(FH); }
      Thanks for that. I have not had a chance to run the code, but will do so tomorrow.

      disciple
Re: Splitting Large File into many small ones.
by Paulster2 (Priest) on Dec 08, 2003 at 21:54 UTC

    What kind of system are you running. I know that on a UNIX system you can do this through the command line. I think it's called split, but I'm not sure off the top of my head. It does it really fast though.

    UPDATE: Went out and looked. It is split that does it on UNIX, with a caveat. It parses out using what ever filename you give it with the extensions being (filename)aa through (filename)zz. If you don't provide a filename it uses x as the filename so you end up with xaa through xzz, for a grand total of 676 files max. You can, however, split those 676 files into smaller files using the same methodry. The default file length that split gives is 1000 lines, so you might have to do some math and adjust accordingly to your file size.

    Paulster2

Re: Splitting Large File into many small ones.
by disciple (Pilgrim) on Dec 08, 2003 at 22:40 UTC
    I am running a windows box.
    Thanks.

      Update: I forgot to paste the link. D'oh!

      In that case, you will probably find this useful. They run markedly quicker than the perl ports or cygwin versions being natively compiled.

      P:\test>split --help Usage: split [OPTION] [INPUT [PREFIX]] Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default PREFIX is `x'. With no INPUT, or when INPUT is -, read standard input +. -b, --bytes=SIZE put SIZE bytes per output file -C, --line-bytes=SIZE put at most SIZE bytes of lines per output f +ile -l, --lines=NUMBER put NUMBER lines per output file -NUMBER same as -l NUMBER --verbose print a diagnostic to standard error just before each output file is opened --help display this help and exit --version output version information and exit SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg. Report bugs to <bug-textutils@gnu.org>.

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

Re: Splitting Large File into many small ones.
by disciple (Pilgrim) on Dec 09, 2003 at 02:16 UTC
    I probably should have clarified one thing. At this point, the work is done and I have no immediate need to do anything else with it. It is purely an exercise now to learn new and better ways of implementing the same thing.

    Thanks all for your comments.

    disciple

      In that case, about the comment I would make about your original is that you could have used $. instead of $i to do the counting. Whether that would be better in anyway is mute though.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!