splitting a large text file and output

research_guy has asked for the wisdom of the Perl Monks concerning the following question:

I have text file that is around 1000 files long and I need to split it every 12 lines or at a header and repeatedly and then output each section to its own .txt file.

The txt file looks like:

VECT
1 1 1
2 2 2
3 3 3

VECT
4 4 4
5 5 5
6 6 6

etc...

There are 12 lines for each section of text that I need to split or I need to split them based on the name of the repeated header. After they are split, I need them to be inputted into individual txt files that are given a unique number for each one (i.e. file1, file2, file3, ... etc). Also, I need the header included in each of the created text files and I would prefer to keep the original file intact and not alter it. I've tried reading it in and setting up for loops and such, but I cannot get anything that works. Your help is much appreciated!

Here is some of the code that I have tried. This one splits the up the file and puts into 3 files what I want to put in just one file so I have 3,000 files instead of 1,000 files. Also, the program didn't put in a file for the last section of information. Any ideas?

#!/usr/bin/perl
use strict;
use warnings;

my $infile = 'roegen6.vect';
my $count = 1;
my $outfile = "$infile-section_$count.vect";
my @arr;

sub create_file {
open(OUT,">$outfile") or die "Error with outfile: $!\n"; 
print OUT @arr;
close(OUT);
@arr=();
$count++;
$outfile="$infile-section_$count.vect";
}

open(IN,$infile) or die "Error with infile $infile: $!\n";
my @data=<IN>;
close(IN);

 foreach my $line (@data) {
  chomp($line);

  if ($line =~ /VECT/) {
      push (@arr, "$line\n");
      next;
  }
  elsif ($line != /\s/) {
      push (@arr, "$line\n");
      next;
  }
  else {
  push (@arr, "$line\n");
  create_file();
  }
 }
[download]

Comment on splitting a large text file and output Download Code

Replies are listed 'Best First'.
Re: splitting a large text file and output by kennethk (Abbot) on Jun 10, 2011 at 14:57 UTC
You say I've tried reading it in and setting up for loops and such, but I cannot get anything that works. What have you tried? What didn't work? What errors did you get? How do you know it didn't work? Post some code (wrapped in `<code>` tags), so we can help guide you to a working tool. This is not a code writing service. Update: Now that you have updated your post with code, I can comment. First, the posted code with the posted input file yields the warnings: Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "1 1 1" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "2 2 2" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "3 3 3" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "" isn't numeric in numeric ne (!=) at fluff.pl line 26, <IN> + line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "4 4 4" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "5 5 5" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. Use of uninitialized value $_ in pattern match (m//) at fluff.pl line +26, <IN> line 9. Argument "6 6 6" isn't numeric in numeric ne (!=) at fluff.pl line 26, + <IN> line 9. [download] This is because you've used numeric unequal (`!=`, Equality Operators) in place of the negative binding operator (`!~`, Binding Operators). This is problematic because without binding, the regular expression is tested against your uninitialized magic variable $_. What you actually meant is not that the line doesn't contain any whitespace, but rather that the line contains a character that is not whitespace. You can achieve this using the `\S` character class, so the block becomes: `elsif ($line =~ /\S/) { push (@arr, "$line\n"); next; }` [download] If we run this, we get your intended outout, but as you say, are missing one output file. This can be resolved by adding a final call to your `create_file` sub, so the final, functional version would be: #!/usr/bin/perl use strict; use warnings; my $infile = 'roegen6.vect'; my $count = 1; my $outfile = "$infile-section_$count.vect"; my @arr; sub create_file { open(OUT,">$outfile") or die "Error with outfile: $!\n"; print OUT @arr; close(OUT); @arr=(); $count++; $outfile="$infile-section_$count.vect"; } open(IN,$infile) or die "Error with infile $infile: $!\n"; my @data=<IN>; close(IN); foreach my $line (@data) { chomp($line); if ($line =~ /VECT/) { push (@arr, "$line\n"); next; } elsif ($line =~ /\S/) { push (@arr, "$line\n"); next; } else { push (@arr, "$line\n"); create_file(); } } create_file if @arr; [download] Note I've put a conditional on the final output, so it will only write if your buffer has content. Not quite how I would have written it from scratch, but it works.	[reply] [d/l] [select]
Re^2: splitting a large text file and output by Gulliver (Monk) on Jun 10, 2011 at 16:35 UTC
The original post has been updated with some code but not enough of the input file was provided and kennethk's other questions weren't answered. The description is confusing because you don't tell us things like what defines a header, what kind of data comes after the header, etc. Are the headers always all text? Or are there numbers or punctuation? Is the data always just numbers? If you can explain the problem in English then you are halfway there for writing the code. My guess as to why it won't output the last section is because your existing code only writes to output when it sees a blank line. If the input file ends without a blank line then no output.	[reply]
Re: splitting a large text file and output by davido (Cardinal) on Jun 10, 2011 at 16:33 UTC
This should be readily adaptable to your schema of variable names and filenames. Please note, it's not valid for interactive input, so don't use it with `<>`, just to be safe. If you needed to take input from `<>` you should remove the `eof()` test, and instead print to one last file after the while loop expires. If you end up doing that, put your output code into a separate subroutine to avoid code duplication. `my @output; my $outfile_name = "Outfile0000"; while ( <DATA> ) { chomp; push @output, $_; if( /VECT/ or $. % 12 == 0 or eof(DATA) ) { open my $out_fh, '>', $outfile_name++ or die $!; say $out_fh $_ for @output; @output = (); close $out_fh or die $!; } }` [download] Dave	[reply] [d/l] [select]
Re: splitting a large text file and output by johngg (Canon) on Jun 10, 2011 at 17:56 UTC
If you want to process an input file section by section, with sections separated by blank lines, you could consider reading the file in paragraph mode. You can do this by changing the default input record separator to an empty string as shown here. `knoppix@Microknoppix:~$ perl -E ' > open my $inFH, q{<}, \ <<EOD or die $!; > VECT > 111 > 222 > 333 > > VECT > 444 > 555 > 666 > 777 > > VECT > 888 > 999 > EOD > > { > local $/ = q{}; > while ( <$inFH> ) > { > print; > say q{=} x 20; > } > }' VECT 111 222 333 ==================== VECT 444 555 666 777 ==================== VECT 888 999 ==================== knoppix@Microknoppix:~$` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l]
Re^2: splitting a large text file and output by 7stud (Deacon) on Jun 10, 2011 at 18:10 UTC
lol. We both even used x 20!	[reply]
Re: splitting a large text file and output by 7stud (Deacon) on Jun 10, 2011 at 18:00 UTC
use strict; use warnings; use 5.010; my $str = <<"ENDOFDATA"; VECT 1 1 1 2 2 2 3 3 3 VECT 4 4 4 5 5 5 6 6 6 ENDOFDATA #open() your original file here: open my $INFILE, '<', \$str or die "Couldn't read from string: $!"; my $section_counter = 1; { local $/ = ""; #tell perl that the end of a line will #be reached when perl encounters any #number of consecutive blank lines while (my $section = <$INFILE>){ chomp($section); my $fname = "file$section_counter"; say $fname; say $section; #or write to a file say '-' x 20; $section_counter++; } } close $INFILE; --output:-- file1 VECT 1 1 1 2 2 2 3 3 3 -------------------- file2 VECT 4 4 4 5 5 5 6 6 6 -------------------- [download]	[reply] [d/l]
Re^2: splitting a large text file and output by johngg (Canon) on Jun 10, 2011 at 22:22 UTC
Uncanny! :-D Cheers, JohnGG	[reply]
Re: splitting a large text file and output by Khen1950fx (Canon) on Jun 10, 2011 at 17:05 UTC
I used File::Split. In order to use File::Split, you'll need to download and manually install Array::Dissect from the backpan. It keeps the source, and it creates a file for each 12 lines. `#!/usr/bin/perl use strict; use warnings; use File::Split; use Data::Dumper::Simple; my $infile = 'roegen6.vect'; my $outfile = "$infile-section.vect"; open IN, '<', $infile or die $!; open OUT, '>', $outfile or die $!; my $fs = File::Split->new({keepSource => 1}); $fs->split_file({'lines' => 12}, $infile); print Dumper($fs); close IN; close OUT;` [download]	[reply] [d/l]
Re: splitting a large text file and output by Marshall (Canon) on Jun 11, 2011 at 15:09 UTC
Adding yet another solution to the mix.. I assumed that new files start with VECT and that no blank lines go into output files. If a file handle like OUT below is open to one file and an open is issued to a new file, that causes an automatic close of the previous file. So the code is actually very simple. Code of course assumes that first non-blank line is a VECT line. `#!/usr/bin/perl -w use strict; my $num =1; while (<DATA>) { next if /^\s*$/; #skip blank lines if (/^VECT/) { open OUT, '>', "file$num" or die "unable to open file$num $!\n"; $num++; } print OUT $_; } close OUT; #to finish the very last file in a clean way __DATA__ VECT 1 1 1 2 2 2 3 3 3 VECT 4 4 4 5 5 5 6 6 6` [download]	[reply] [d/l]