Micz has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a file (about 500MB), which contains
NAME
data
...
data
$
NAME
data
...
data

About 22,000 times. "data...data" can be up to 30-40 rows. I would like to split the one huge file into different files. The separation would be done at the "$". The filename of the single files should be the NAME. I tried the following:
#!/usr/bin/perl -w $/="\$\n"; open(A,'filename')||die$!; open(B,'>',(split/\n/)[1])and+chomp,print(B)while<A>; __END__
I get all kinds of errors, memory fills up etc. Does anyone have a code snippet which might solve my problems. Thanks in advance!

Replies are listed 'Best First'.
Re: Splitting long file
by Abigail-II (Bishop) on Apr 08, 2004 at 10:52 UTC
    One way of doing it would be using a simple state machine. Here's an untested piece of code:
    #!/usr/bin/perl use strict; use warnings; my $STATE = "FILE"; my $name; my $fh; while (<>) { if ($STATE eq "DATA") { if ($_ eq "\$\n") { close $fh or die "close '$name': $!\n"; $STATE = "NAME"; next; } print $fh $_; next; } elsif ($STATE eq "FILE") { chomp; $name = $_; open $fh => ">", $name or die "open '$name': $!\n"; $STATE = "DATA"; next; } else { die "Unknown state '$STATE'.\n"; } } if ($STATE eq "DATA") { close $fh or die "close '$name': $!\n"; } __END__
    Abigail
Re: Splitting long file
by thor (Priest) on Apr 08, 2004 at 12:38 UTC
    local $/="\$\n"; open(A, 'file'); while(<A>) { if (m/^(\w+)\n/) { #use whatever reges matches 'DATA' here open(my $fh, ">$1") or die "Couldn't open '$1' for write: $!"; print $fh $_; close $fh; } }
    Add in some error checking, and you should be golden. This method has the advantage of only holding one chunk of data in memory at a time, so it scales well. Beware however that this will happily clobber files if there are two secions with the same header. To avoid that, open the output file for append ('>>' instead of '>').

    thor

Re: Splitting long file
by BrowserUk (Patriarch) on Apr 08, 2004 at 10:25 UTC

    This is untested, but should be close

    perl -044ne"($n,$d)=m[\n?([^\n]+)\n(.+)\$?]m;open O,'>$n';print O $d;c +lose O;" file

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      This seems closer. (NB: Win32 quotes switch to *nix-style)

      perl -044ne '($n,$d)=m!\n?([^\n]+)\n([^\$]+)!s;open O,">$n";print O $d +' file
      Update: 044 removed from command line
      D'oh.


      davis
      It's not easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.
Re: Splitting long file
by blue_cowdawg (Monsignor) on Apr 08, 2004 at 17:03 UTC

        Does anyone have a code snippet which might solve my problems.

    In the spirit of TIMTOWTDI and the fact that I have been having a blast using Tie::File lately here is another solution.

    Preconditions

    First off I set up a very small data file to test with:
    Tom data data data data data $ Dick data data data data data $ Harry data data data data data $
    This is a very small test set I realize but wasn't really ambitious enough to create a file by hand as large as you were talking about.

    The code

    Here is another suggestion for how to solve the problem:
    #!/usr/bin/perl -w ############################# use strict; use warnings; use Tie::File; my @ry=(); # This will be tied my $OLDIFS=$/; # save the IFS $/="\$\n"; # now change it to suit our needs # # Rope tie and brand 'em! YEEE-HAW! tie @ry,"Tie::File","datFile.txt" or die $!; foreach my $rec(@ry){ #iterate through the records chomp $rec; # Eliminate IFS next unless $rec; # Eliminate spew about blank records # if they happen my $fname=(split(/\n/,$rec))[0]; # Get the name next unless $fname; # ho hum. Sometimes you're the # windshield, sometimes you're # the bug. open FOUT,"> $fname" or die "$fname:$!"; # Open file print FOUT $rec; #store record close FOUT; # done with this } untie @ry; # cut them thar ropes!

    The potential issue with this solution is I've been told (and I can't verify this one way or another) is that Tie::File is memory greedy and will slurp the whole file into memory in one gulp. If you have a memory starved box that you are running this script on and the assertion of memory use is true then you might have an issue. You mention having memory issues in your OP and I don't know how starved for memory the box you are running this on is..

    You didn't specifiy if you wanted to preserve the dollar sign record separator or not but if you eliminate the chomp you will accomplish that as well.

    HTH!

Re: Splitting long file
by TilRMan (Friar) on Apr 08, 2004 at 10:08 UTC

    Hmmm. I'm sure I can work a ... in somehow:

    open BIGFILE, $filename or die; while (<BIGFILE>) { if (/(.+)$/ ... /^\$$/ && next) { if (defined $1) { open SMALLFILE, ">$1" or die; next } print SMALLFILE; next; } warn "Untested code is bad!"; } close BIGFILE;
      You made a classic mistake there: $1 stays set until the next successful match, not until the next regexp invocation.

      My take on the whole problem would be slightly more verbose:

      open BIGFILE, $filename or die "Could not open $filename:$!\n"; while (<BIGFILE>) { unless (defined($smallfile)) { $smallfile=$_; chomp $smallfile; open(SMALLFILE,">$smallfile") or die "Could not open smallfile $sm +allfile (referenced in $.): $!\n"; } elsif (/^\$$/) { close(SMALLFILE) || die "Could not close $smallfile:$!\n"; $smallfile=undef; } else { print SMALLFILE; } } close BIGFILE;
      Update: $filename should contain the name of the BIG file, of course.

        Too right! That oughta learn me.

        Rereading, I now see the approach of the OP. Perhaps (again untested):

        $/ = "\$\n"; open BIGFILE, yada yada or die; while (<BIGFILE>) { my ($filename, $guts) = split /\n/, $_, 2; open SMALLFILE, ">$filename" or die; print SMALLFILE $guts; close SMALLFILE; } close BIGFILE;
        or even
        use File::Slurp qw( write_file ); $/ = "\$\n"; open BIGFILE, yada yada; while (<BIGFILE>) { write_file(split /\n/, $_, 2); }
        I tried this with a smaller file, it seems to work. Great!

        One error appears though: "Name "main::filename" used only once: possible typo at matija.pl line 3."

        Is this something I need to solve? Thanks!
Re: Splitting long file
by EvdB (Deacon) on Apr 08, 2004 at 11:20 UTC
    Untested:
    open BIG, 'big_file'; my $i = 1; while (<BIG>) { open OUT, ">outfile_" .$i++; #BUG: see follow up below. if ( m/^\$\n$/ ) { close OUT; open OUT, ">outfile_" .$i++; next; } print OUT; } close OUT;
    Doesn't deal with your naming but that should be easy to add.

    PS. if you intend to create 22000 files you should probably put them into subdirectories, not all in one directory.

    --tidiness is the memory loss of environmental mnemonics

      Since you are reading line by line, the
      if ( m/^\$\n$/ )
      will never match.
      Update: I am an idiot. This will hopefully teach me to pay closer attention to which $ is meant as end of line, and which is meant as literal $. (And I will start using \z in my regexps...)
        It does work because I have not chomped the line.

        However the second open is in the wrong place:

        ### WRONG ### while (<BIG>) { open OUT, ">outfile_" .$i++; if ( m/^\$\n$/ ) { ### CORRECT ### open OUT, ">outfile_" .$i++; while (<BIG>) { if ( m/^\$\n$/ ) {

        --tidiness is the memory loss of environmental mnemonics

Re: Splitting long file
by runrig (Abbot) on Apr 08, 2004 at 19:11 UTC
    #!/usr/bin/awk -f /^\$$/ { print >file close(file) file = "" next } { if (file == "") { file=$0 } print >file }
    Then just run (assuming this is saved as split_big_file):
    split_big_file big_file
    ( Oh wait, ah, this isn't awkmonks? Sorry, wrong monastery... ) Well, errrm, just take the previously mentioned file, and run:
    a2p split_big_file
    to get the perl version :-)

    Updated. Fixed close statement (put parens around 'file' arg). Changed 'continue' to 'next' (although the awk worked-it didn't complain anyway, it confused a2p).

Re: Splitting long file
by eXile (Priest) on Apr 08, 2004 at 15:31 UTC
    A potential problem with your code was that it leaves open a whole bunch of file descriptors (about 22,001) because you didn't close <B>, while it only has to have 2 open file descriptors at a time. Most OSs (all?) have a maximum amount of file descriptors that can be open at a given time.
    Hope this helps avoiding this the next time.
      Perl is smart enough to close the old file when you open another file on the same filehandle, therefore this isn't a problem.
        then I've learned something from this thread as well. Thanks tilly!