Re: Splitting long file
by Abigail-II (Bishop) on Apr 08, 2004 at 10:52 UTC
|
One way of doing it would be using a simple
state machine. Here's an untested piece of code:
#!/usr/bin/perl
use strict;
use warnings;
my $STATE = "FILE";
my $name;
my $fh;
while (<>) {
if ($STATE eq "DATA") {
if ($_ eq "\$\n") {
close $fh or die "close '$name': $!\n";
$STATE = "NAME";
next;
}
print $fh $_;
next;
}
elsif ($STATE eq "FILE") {
chomp;
$name = $_;
open $fh => ">", $name or die "open '$name': $!\n";
$STATE = "DATA";
next;
}
else {
die "Unknown state '$STATE'.\n";
}
}
if ($STATE eq "DATA") {
close $fh or die "close '$name': $!\n";
}
__END__
Abigail | [reply] [d/l] |
Re: Splitting long file
by thor (Priest) on Apr 08, 2004 at 12:38 UTC
|
local $/="\$\n";
open(A, 'file');
while(<A>) {
if (m/^(\w+)\n/) { #use whatever reges matches 'DATA' here
open(my $fh, ">$1") or die "Couldn't open '$1' for write: $!";
print $fh $_;
close $fh;
}
}
Add in some error checking, and you should be golden. This method has the advantage of only holding one chunk of data in memory at a time, so it scales well. Beware however that this will happily clobber files if there are two secions with the same header. To avoid that, open the output file for append ('>>' instead of '>').
| [reply] [d/l] |
Re: Splitting long file
by BrowserUk (Patriarch) on Apr 08, 2004 at 10:25 UTC
|
This is untested, but should be close
perl -044ne"($n,$d)=m[\n?([^\n]+)\n(.+)\$?]m;open O,'>$n';print O $d;c
+lose O;" file
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
|
|
perl -044ne '($n,$d)=m!\n?([^\n]+)\n([^\$]+)!s;open O,">$n";print O $d
+' file
Update: 044 removed from command line
D'oh.
davis
It's not easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.
| [reply] [d/l] |
Re: Splitting long file
by blue_cowdawg (Monsignor) on Apr 08, 2004 at 17:03 UTC
|
Does anyone have a code snippet which might solve my problems.
In the spirit of TIMTOWTDI and the fact that I have been
having a blast using Tie::File
lately here is another
solution.
Preconditions
First off I set up a very small data file to test with:
Tom
data
data
data
data
data
$
Dick
data
data
data
data
data
$
Harry
data
data
data
data
data
$
This is a very small test set I realize but wasn't really
ambitious enough to create a file by hand as large as you
were talking about.
The code
Here is another suggestion for how to solve the problem:
#!/usr/bin/perl -w
#############################
use strict;
use warnings;
use Tie::File;
my @ry=(); # This will be tied
my $OLDIFS=$/; # save the IFS
$/="\$\n"; # now change it to suit our needs
#
# Rope tie and brand 'em! YEEE-HAW!
tie @ry,"Tie::File","datFile.txt" or die $!;
foreach my $rec(@ry){ #iterate through the records
chomp $rec; # Eliminate IFS
next unless $rec; # Eliminate spew about blank records
# if they happen
my $fname=(split(/\n/,$rec))[0]; # Get the name
next unless $fname; # ho hum. Sometimes you're the
# windshield, sometimes you're
# the bug.
open FOUT,"> $fname" or die "$fname:$!"; # Open file
print FOUT $rec; #store record
close FOUT; # done with this
}
untie @ry; # cut them thar ropes!
The potential issue with this solution is
I've been told (and I can't
verify this one way or another) is that Tie::File
is memory greedy and will slurp the whole file into memory
in one gulp. If you have a memory starved box that you are
running this script on and the assertion of memory use
is true then you might have an issue. You mention having
memory issues in your OP and I don't know how starved for
memory the box you are running this on is..
You didn't specifiy if you wanted to preserve the
dollar sign record separator or not but if you eliminate
the chomp you will accomplish that as well.
HTH!
| [reply] [d/l] [select] |
Re: Splitting long file
by TilRMan (Friar) on Apr 08, 2004 at 10:08 UTC
|
open BIGFILE, $filename or die;
while (<BIGFILE>)
{
if (/(.+)$/ ... /^\$$/ && next)
{
if (defined $1) { open SMALLFILE, ">$1" or die; next }
print SMALLFILE;
next;
}
warn "Untested code is bad!";
}
close BIGFILE;
| [reply] [d/l] [select] |
|
|
open BIGFILE, $filename or die "Could not open $filename:$!\n";
while (<BIGFILE>) {
unless (defined($smallfile)) {
$smallfile=$_;
chomp $smallfile;
open(SMALLFILE,">$smallfile") or die "Could not open smallfile $sm
+allfile (referenced in $.): $!\n";
} elsif (/^\$$/) {
close(SMALLFILE) || die "Could not close $smallfile:$!\n";
$smallfile=undef;
} else {
print SMALLFILE;
}
}
close BIGFILE;
Update: $filename should contain the name of the BIG file, of course. | [reply] [d/l] |
|
|
Too right! That oughta learn me.
Rereading, I now see the approach of the OP. Perhaps (again untested):
$/ = "\$\n";
open BIGFILE, yada yada or die;
while (<BIGFILE>)
{
my ($filename, $guts) = split /\n/, $_, 2;
open SMALLFILE, ">$filename" or die;
print SMALLFILE $guts;
close SMALLFILE;
}
close BIGFILE;
or even
use File::Slurp qw( write_file );
$/ = "\$\n";
open BIGFILE, yada yada;
while (<BIGFILE>) {
write_file(split /\n/, $_, 2);
}
| [reply] [d/l] [select] |
|
|
I tried this with a smaller file, it seems to work. Great!
One error appears though: "Name "main::filename" used only once: possible typo at matija.pl line 3."
Is this something I need to solve? Thanks!
| [reply] |
Re: Splitting long file
by EvdB (Deacon) on Apr 08, 2004 at 11:20 UTC
|
open BIG, 'big_file';
my $i = 1;
while (<BIG>) {
open OUT, ">outfile_" .$i++; #BUG: see follow up below.
if ( m/^\$\n$/ ) {
close OUT;
open OUT, ">outfile_" .$i++;
next;
}
print OUT;
}
close OUT;
Doesn't deal with your naming but that should be easy to add.
PS. if you intend to create 22000 files you should probably put them into subdirectories, not all in one directory.
--tidiness is the memory loss of environmental mnemonics
| [reply] [d/l] [select] |
|
|
Since you are reading line by line, the
if ( m/^\$\n$/ )
will never match.
Update: I am an idiot. This will hopefully teach me to pay closer attention to which $ is meant as end of line, and which is meant as literal $. (And I will start using \z in my regexps...) | [reply] [d/l] |
|
|
### WRONG ###
while (<BIG>) {
open OUT, ">outfile_" .$i++;
if ( m/^\$\n$/ ) {
### CORRECT ###
open OUT, ">outfile_" .$i++;
while (<BIG>) {
if ( m/^\$\n$/ ) {
--tidiness is the memory loss of environmental mnemonics
| [reply] [d/l] [select] |
Re: Splitting long file
by runrig (Abbot) on Apr 08, 2004 at 19:11 UTC
|
#!/usr/bin/awk -f
/^\$$/ {
print >file
close(file)
file = ""
next
}
{
if (file == "") {
file=$0
}
print >file
}
Then just run (assuming this is saved as split_big_file):split_big_file big_file
( Oh wait, ah, this isn't awkmonks? Sorry, wrong monastery... ) Well, errrm, just take the previously mentioned file, and run:a2p split_big_file
to get the perl version :-)
Updated. Fixed close statement (put parens around 'file' arg). Changed 'continue' to 'next' (although the awk worked-it didn't complain anyway, it confused a2p).
| [reply] [d/l] [select] |
Re: Splitting long file
by eXile (Priest) on Apr 08, 2004 at 15:31 UTC
|
A potential problem with your code was that it leaves open a whole bunch of file descriptors (about 22,001) because you didn't close <B>, while it only has to have 2 open file descriptors at a time. Most OSs (all?) have a maximum amount of file descriptors that can be open at a given time.
Hope this helps avoiding this the next time. | [reply] [d/l] |
|
|
Perl is smart enough to close the old file when you open another file on the same filehandle, therefore this isn't a problem.
| [reply] |
|
|
then I've learned something from this thread as well.
Thanks tilly!
| [reply] |
|
|