Splitting long file

Micz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting long file by Abigail-II (Bishop) on Apr 08, 2004 at 10:52 UTC
One way of doing it would be using a simple state machine. Here's an untested piece of code: `#!/usr/bin/perl use strict; use warnings; my $STATE = "FILE"; my $name; my $fh; while (<>) { if ($STATE eq "DATA") { if ($_ eq "\$\n") { close $fh or die "close '$name': $!\n"; $STATE = "NAME"; next; } print $fh $_; next; } elsif ($STATE eq "FILE") { chomp; $name = $_; open $fh => ">", $name or die "open '$name': $!\n"; $STATE = "DATA"; next; } else { die "Unknown state '$STATE'.\n"; } } if ($STATE eq "DATA") { close $fh or die "close '$name': $!\n"; } __END__` [download] Abigail	[reply] [d/l]
Re: Splitting long file by thor (Priest) on Apr 08, 2004 at 12:38 UTC
`local $/="\$\n"; open(A, 'file'); while(<A>) { if (m/^(\w+)\n/) { #use whatever reges matches 'DATA' here open(my $fh, ">$1") or die "Couldn't open '$1' for write: $!"; print $fh $_; close $fh; } }` [download] Add in some error checking, and you should be golden. This method has the advantage of only holding one chunk of data in memory at a time, so it scales well. Beware however that this will happily clobber files if there are two secions with the same header. To avoid that, open the output file for append ('>>' instead of '>'). thor	[reply] [d/l]
Re: Splitting long file by BrowserUk (Patriarch) on Apr 08, 2004 at 10:25 UTC
This is untested, but should be close `perl -044ne"($n,$d)=m[\n?([^\n]+)\n(.+)\$?]m;open O,'>$n';print O $d;c +lose O;" file` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Re: Splitting long file by davis (Vicar) on Apr 08, 2004 at 10:43 UTC
This seems closer. (NB: Win32 quotes switch to nix-style) `perl -044ne '($n,$d)=m!\n?([^\n]+)\n([^\$]+)!s;open O,">$n";print O $d +' file` [download] Update:* 044 removed from command line D'oh. davis It's not easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.	[reply] [d/l]
Re: Splitting long file by blue_cowdawg (Monsignor) on Apr 08, 2004 at 17:03 UTC
Does anyone have a code snippet which might solve my problems. In the spirit of TIMTOWTDI and the fact that I have been having a blast using Tie::File lately here is another solution. Preconditions First off I set up a very small data file to test with: `Tom data data data data data $ Dick data data data data data $ Harry data data data data data $` [download] This is a very small test set I realize but wasn't really ambitious enough to create a file by hand as large as you were talking about. The code Here is another suggestion for how to solve the problem: #!/usr/bin/perl -w ############################# use strict; use warnings; use Tie::File; my @ry=(); # This will be tied my $OLDIFS=$/; # save the IFS $/="\$\n"; # now change it to suit our needs # # Rope tie and brand 'em! YEEE-HAW! tie @ry,"Tie::File","datFile.txt" or die $!; foreach my $rec(@ry){ #iterate through the records chomp $rec; # Eliminate IFS next unless $rec; # Eliminate spew about blank records # if they happen my $fname=(split(/\n/,$rec))[0]; # Get the name next unless $fname; # ho hum. Sometimes you're the # windshield, sometimes you're # the bug. open FOUT,"> $fname" or die "$fname:$!"; # Open file print FOUT $rec; #store record close FOUT; # done with this } untie @ry; # cut them thar ropes! [download] The potential issue with this solution is I've been told (and I can't verify this one way or another) is that Tie::File is memory greedy and will slurp the whole file into memory in one gulp. If you have a memory starved box that you are running this script on and the assertion of memory use is true then you might have an issue. You mention having memory issues in your OP and I don't know how starved for memory the box you are running this on is.. You didn't specifiy if you wanted to preserve the dollar sign record separator or not but if you eliminate the chomp you will accomplish that as well. HTH!	[reply] [d/l] [select]
Re: Splitting long file by TilRMan (Friar) on Apr 08, 2004 at 10:08 UTC
Hmmm. I'm sure I can work a `...` in somehow: `open BIGFILE, $filename or die; while (<BIGFILE>) { if (/(.+)$/ ... /^\$$/ && next) { if (defined $1) { open SMALLFILE, ">$1" or die; next } print SMALLFILE; next; } warn "Untested code is bad!"; } close BIGFILE;` [download]	[reply] [d/l] [select]
Re: Re: Splitting long file by matija (Priest) on Apr 08, 2004 at 10:25 UTC
You made a classic mistake there: $1 stays set until the next successful match, not until the next regexp invocation. My take on the whole problem would be slightly more verbose: `open BIGFILE, $filename or die "Could not open $filename:$!\n"; while (<BIGFILE>) { unless (defined($smallfile)) { $smallfile=$_; chomp $smallfile; open(SMALLFILE,">$smallfile") or die "Could not open smallfile $sm +allfile (referenced in $.): $!\n"; } elsif (/^\$$/) { close(SMALLFILE) \|\| die "Could not close $smallfile:$!\n"; $smallfile=undef; } else { print SMALLFILE; } } close BIGFILE;` [download] Update: `$filename` should contain the name of the BIG file, of course.	[reply] [d/l]
Re: Re: Re: Splitting long file by TilRMan (Friar) on Apr 08, 2004 at 10:47 UTC
Too right! That oughta learn me. Rereading, I now see the approach of the OP. Perhaps (again untested): `$/ = "\$\n"; open BIGFILE, yada yada or die; while (<BIGFILE>) { my ($filename, $guts) = split /\n/, $_, 2; open SMALLFILE, ">$filename" or die; print SMALLFILE $guts; close SMALLFILE; } close BIGFILE;` [download] or even `use File::Slurp qw( write_file ); $/ = "\$\n"; open BIGFILE, yada yada; while (<BIGFILE>) { write_file(split /\n/, $_, 2); }` [download]	[reply] [d/l] [select]
Re: Re: Re: Splitting long file by Micz (Beadle) on Apr 08, 2004 at 11:32 UTC
I tried this with a smaller file, it seems to work. Great! One error appears though: "Name "main::filename" used only once: possible typo at matija.pl line 3." Is this something I need to solve? Thanks!	[reply]
Re: Splitting long file by EvdB (Deacon) on Apr 08, 2004 at 11:20 UTC
Untested: `open BIG, 'big_file'; my $i = 1; while (<BIG>) { open OUT, ">outfile_" .$i++; #BUG: see follow up below. if ( m/^\$\n$/ ) { close OUT; open OUT, ">outfile_" .$i++; next; } print OUT; } close OUT;` [download] Doesn't deal with your naming but that should be easy to add. PS. if you intend to create 22000 files you should probably put them into subdirectories, not all in one directory. --`tidiness is the memory loss of environmental mnemonics`	[reply] [d/l] [select]
Re: Re: Splitting long file by matija (Priest) on Apr 08, 2004 at 11:39 UTC
Since you are reading line by line, the `if ( m/^\$\n$/ )` [download] ~~will never match.~~ Update: I am an idiot. This will hopefully teach me to pay closer attention to which $ is meant as end of line, and which is meant as literal $. (And I will start using \z in my regexps...)	[reply] [d/l]
Re: Re: Re: Splitting long file by EvdB (Deacon) on Apr 08, 2004 at 12:42 UTC
It does work because I have not `chomp`ed the line. However the second open is in the wrong place: `### WRONG ### while (<BIG>) { open OUT, ">outfile_" .$i++; if ( m/^\$\n$/ ) { ### CORRECT ### open OUT, ">outfile_" .$i++; while (<BIG>) { if ( m/^\$\n$/ ) {` [download] --`tidiness is the memory loss of environmental mnemonics`	[reply] [d/l] [select]
Re: Splitting long file by runrig (Abbot) on Apr 08, 2004 at 19:11 UTC
`#!/usr/bin/awk -f /^\$$/ { print >file close(file) file = "" next } { if (file == "") { file=$0 } print >file }` [download] Then just run (assuming this is saved as split_big_file): `split_big_file big_file` [download] ( Oh wait, ah, this isn't awkmonks? Sorry, wrong monastery... ) Well, errrm, just take the previously mentioned file, and run: `a2p split_big_file` [download] to get the perl version :-) Updated. Fixed close statement (put parens around 'file' arg). Changed 'continue' to 'next' (although the awk worked-it didn't complain anyway, it confused a2p).	[reply] [d/l] [select]
Re: Splitting long file by eXile (Priest) on Apr 08, 2004 at 15:31 UTC
A potential problem with your code was that it leaves open a whole bunch of file descriptors (about 22,001) because you didn't close `<B>`, while it only has to have 2 open file descriptors at a time. Most OSs (all?) have a maximum amount of file descriptors that can be open at a given time. Hope this helps avoiding this the next time.	[reply] [d/l]
Re: Re: Splitting long file by tilly (Archbishop) on Apr 08, 2004 at 15:34 UTC
Perl is smart enough to close the old file when you open another file on the same filehandle, therefore this isn't a problem.	[reply]
Re: Re: Re: Splitting long file by eXile (Priest) on Apr 08, 2004 at 16:37 UTC
then I've learned something from this thread as well. Thanks tilly!	[reply]
Re: Re: Re: Re: Splitting long file by runrig (Abbot) on Apr 08, 2004 at 21:34 UTC

Preconditions

The code