insta.gator has asked for the wisdom of the Perl Monks concerning the following question:

Hi All. Junior Perl coder here. Trying to write a script to split a large file of possibly millions of similar records into multiple smaller files. Have it for the most part but there is one caveat. I cannot split same record types across files. So for example, if I have 1000 record type "A" appearing sequentially in the large file, and I reach my defined "smaller" file size while in the middle of said "A" records, I need to continue adding to the smaller file with the "A"s until I reach the next record type in the large file at which time I would want to start the next "small" file. I am having trouble with that part of the script. I cannot figure out how to make record pointer advance to the next record in the "big" file so that I can "step" through record by record until I find the next record type which will be my trigger to start te next "small" file. Following is my code. Any help would be greatly appreciated.

foreach $filename (@prfiles) { chomp $filename; $total_recs=0; $counter_recs=0; $previous_rec_ssn=0; $file_size_met="N"; open (INFILE1, '<', "$filename") or print "Cannot open $filename. +"; print "Now processing: $filename\n"; while (<INFILE1>) { if ($file_size_met ne "Y") { $file_count=1; $mod_filename="$filename\.$file_count" ; print "Writing output to: $mod_filename\n\n"; open (OUTFILE1, ">>" . "$mod_filename") or exit(201); while ($counter_recs < 50000) { print OUTFILE1 $_; $total_recs=($total_recs + 1); $counter_recs=($counter_recs + 1); } print "$total_recs records have been processed\n"; $counter_recs=0; } $actual_size = (stat($mod_filename))[7]; if ($actual_size >= $outsize){$file_size_met="Y"}; print "Current file size is $actual_size bytes.\n"; # ***** THIS IS WHERE I AM GOING OFF THE RAILS ***** if (($actual_size == $outsize) or ($actual_size > $outsize) an +d ($file_size_met eq "Y")) { @line_contents = split (/\|/,$_); $record_ssn=($line_contents[6]); print "current SSN is $record_ssn.\n"; if ($previous_rec_ssn == $record_ssn) { print OUTFILE1 $_; $previous_rec_ssn=$record_ssn; print "previous SSN is $previous_rec_ssn.\n"; print "current SSN is $record_ssn.\n"; } next; } next; } close INFILE1; close OUTFILE1; print "\n$total_recs records were processed for $filename\n\n"; }

Replies are listed 'Best First'.
Re: Large FIle Splitter
by 1nickt (Canon) on Mar 01, 2016 at 16:58 UTC

    Hi insta.gator,

    Your code has lots of unneccessary cruft that makes it hard to read, a lot having to do with quotation ... I suggest a closer read through some introductory material such as perlsyn (Perl Syntax).

    I cleaned up your code down to your problem line. It's below for your review. Untested and incomplete but should give you some ideas. Some things I did include:

    • removing the chomp line (should do that when you build the array)
    • using strict and declaring your variables in scope
    • letting $file_size_met be a numeric value so 0 is false and anything above that is true
    • prefix increment operator
    • opening your filehandle with a lexically-scoped variable
    • and maybe some other stuff, take a look.

    In doing so I can see at least one issue which is that you are opening your OUTFILE1 each time through your while loop. I still can't see exactly what your problem is, but it looks like you want to stop writing to the file if it's reached a certain size, but not split a line. You think you're going to split a line, so you want to back up and start the next file from the start of the broken line. Right? Anyway, it's a lot like an XY Problem, and there are lots of ready-made solutions (check CPAN) that will allow you skip writing all that code yourself.

    foreach my $filename (@prfiles) { my ( $total_recs, $counter_recs, $previous_rec_ssn, $file_count, $ +file_size_met ); open ( my $INFILE1, '<', $filename ) or die "Cannot open $filename +: $!"; print "Now processing: $filename\n"; while ( <$INFILE1> ) { if ( not $file_size_met ) { ++$file_count; my $mod_filename = $filename . $file_count; print "Writing output to: $mod_filename\n\n"; open ( my $OUTFILE1, '>>' $mod_filename ) or die( 201 ); while ( $counter_recs < 50000 ) { print $OUTFILE1 $_; ++$total_recs; ++$counter_recs; } print "$total_recs records have been processed\n"; $counter_recs=0; } my $actual_size = (stat($mod_filename))[7]; ++$file_size_met if $actual_size >= $outsize; print "Current file size is $actual_size bytes.\n";

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Large File Splitter
by Anonymous Monk on Mar 01, 2016 at 16:42 UTC

    It may be easier to think about if you turn the logic around: first, write some code that identifies when the record type changes. Then, when that happens, and your output file has reached the maximum size, you close your current output file and open the new one.

    Also, don't use stat to check the current file size, since you don't need to go to the file system to find that out, use tell instead.

    use warnings; use strict; # this just generates an example input file, not part of the solution open my $tfh, '>', 'input.txt' or die $!; print $tfh "$_\n" x (rand(3)+1) for 'A'..'Z'; close $tfh; # now split the input file my $MAXSIZE = 20; open my $ifh, '<', 'input.txt' or die $!; my $outcount=0; open my $ofh, '>', 'out000.txt' or die $!; my $prev; while (<$ifh>) { chomp; if (defined $prev && $prev ne $_) { print STDERR "Record type switched from $prev to $_, output si +ze is ".tell($ofh)."\n"; if (tell($ofh)>$MAXSIZE) { close $ofh; my $ofn = sprintf 'out%03d.txt', ++$outcount; print STDERR "Opening new output file $ofn\n"; open $ofh, '>', $ofn or die $!; } } print $ofh $_, "\n"; $prev = $_; } close $ofh; close $ifh;
Re: Large FIle Splitter
by insta.gator (Novice) on Mar 02, 2016 at 13:39 UTC

    Thanks to both 1nick2 and anonymous for the help. Yes, I am sure that my code is ugly and redundant and I am sure that there are better ways to do what I am trying to do. I am still learning. Self taught using a book. And since I don't do this for a living, I don't get a lot of practice. Thanks again. You answers were spot-on. -Gator