File Processing...

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

Let me say first I've visisted this site for a long time: you guys are a great font of knowledge the the n00b programmer like myself.

And with that...

I am working a program at my internship for processing some huge text files(1.5 gig ish). The goal is to split the file into 200mb chunks for use in other programs.

Searching google we found some code to use that works well for splitting files ( "http://search.cpan.org/src/CWEST/ppt-0.14/bin/split" )

My question comes at the part...

while (read (INFILE, $chunk, $count) == $count) {
    $fh = nextfile ($prefix);
    print $fh $chunk;
    }
[download]

I want to edit this so that, when it hits the chunk size, it will keep going UNTIL it hits a certain string. Where it will then resume creating another chunk.

My confusion comes with how that while loop flag condition works. My experience with Perl ended before ever seeing something like this.

I would be grateful to be pointed to some resource or something to look up to set me in the right direction. As, well, I really would like to know what the heck I'm doing...heh...

Thanks,

Mertz

Comment on File Processing... Download Code

Replies are listed 'Best First'.
Re: File Processing... by GrandFather (Saint) on Mar 20, 2006 at 00:39 UTC
Here's a start point: use warnings; use strict; my $chunk; my $chunkSize = 40; my $matchText = 'I'; my $partCount = 1; my $fileSize = 0; while (my $newChunkSize = read (DATA, $chunk, $chunkSize)) { next if $fileSize + $newChunkSize < $chunkSize; # not a complete c +hunk # Have at least $chunkSize - look for a stop point my $breakPoint = index $chunk, $matchText; if ($breakPoint == -1) { $fileSize += $newChunkSize; next; # Can't change file yet } # Found a stop point, all change print substr $chunk, 0, $breakPoint, ''; # close current output file here and create the new file ++$partCount; print "\n\n***Start of new file for part $partCount\n"; $fileSize = $newChunkSize - $breakPoint; } continue { print $chunk; } __DATA__ [download] Read more... data and output (4 kB) DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: File Processing... by ayrnieu (Beadle) on Mar 20, 2006 at 00:08 UTC
use `Tie::File`; I have seen it quite capable of dealing with very large files given very little wiggle room. Oh, wait, 200MBs? Nevermind. It doesn't seem that you need exactly 200MBs, as you want to know how to conditionally go over 200MBs; also, you are dealing with text files... is it possible that you actually want to split the file by your 'certain string'?	[reply] [d/l]
Re^2: File Processing... by Anonymous Monk on Mar 20, 2006 at 01:55 UTC
Yeah, it will be +/- 200mbs. I asked them this too, if I could just split the file by strings...that'd be easy.... Since I don't know what the data is for or from(they don't tell me, since I am not a permanent employee) they just said that these strings are too frequent among the file. They're something like START_HTML and END_HMTL is what I've heard. Maybe its a web crawler? But they said that splitting based on this would make thousands and thousands of files. So I can't.	[reply]
Re: File Processing... by TedPride (Priest) on Mar 20, 2006 at 06:29 UTC
It's probably best to work with smaller chunks than 200 MB. There's no reason why you can't convert the files using something like: use strict; use warnings; my ($fsize, $csize, $file, $fcount, $read, $in, $out, $chunk); $fsize = 1024 * 1024 * 200; # Size of file you want $csize = 1024 * 1024; # Size of chunks $/ = "\n"; # String to stop at $file = "test.dat"; $fcount = 1; # Number for file names $read = 0; $out = nextfile($file, $fcount); open($in, $file); while (read($in, $chunk, $csize) == $csize) { print $out $chunk; $read += $csize; if ($read >= $fsize) { print $out $_ = <$in>; $read = 0; close($out); $out = nextfile($file, $fcount); } } close($in); close($out); sub nextfile { my ($name, $handle) = $_[0]; $name =~ s/\./$_[1]./; $_[1]++; open($handle, ">$name"); return $handle; } [download] Note that this version includes the stop string in the previous file. If you want it included in the next file: `if ($read >= $fsize) { chomp($_ = <$in>); print $out $_; $read = 0; close($out); $out = nextfile($file, $fcount); print $out $/; }` [download]	[reply] [d/l] [select]