az1962 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to split a file into multiple smaller files based on a regex and file naming convention I am searching for lines that begin with the word zone, and want to use the second word in the line to use as a file name. I then want to write all lines in the file into that filename using the second word in the line, until I hit another line that starts with the same word and use the second word in that line as the next filename. Based on the data file below, it should create a file named 1file.txt, and a file named 2fileb, with 6 lines of data each.

#!/usr/bin/env perl use strict; use warnings; open(my $infh, '<', 'testdat.dat') or die $!; my $outfh; my $filecount = 0; while ( my $line = <$infh> ) { if ( $line =~ /^zone>\s*(\w+\W+(\w+)\s*$)/ ) { close($outfh) if $outfh; $outfh = $2; open($outfh, '>', sprintf($outfh'.txt', ++$filecount)) or die +$!; } print {$outfh} $line or die "Failed to write to file: $!"; } close($outfh); close($infh); Datafile = zone 1filea 1ss record1a record1b record1c record 1d 2 record empty endoffile zone 2fileb 1ss record1a record1b record1c record 1d 2 record empty endoffile
  • Comment on Using the second word in a line to split a file into multiple files
  • Download Code

Replies are listed 'Best First'.
Re: Using the second word in a line to split a file into multiple files
by haukex (Archbishop) on Aug 26, 2019 at 07:27 UTC

    There are unfortunately several issues with your code:

    • You're using the second capture group as the filename, which according to the regex is actually the third word in the line. You probably meant something like /^zone\s+(\w+)\W+\w+\s*$/, and then you can use the capture variable $1.
    • As jcb mentioned, you've got a stray > in the regex.
    • sprintf($outfh'.txt',... is not valid syntax. Please make sure to always provide a runnable SSCCE when asking questions.
    • I don't understand why you have a $filecount variable, since according to your description, you want the filenames to be derived from just the section names.
    • Using the variable $outfh to hold both the filename and the filehandle is not good practice, you should use a separate variable to hold the filename.
    • You print to the output file unconditionally, even when it's not open yet. The same goes for the close.
    • You say you want the output files for this example input to contain six lines each, but you don't say which of the seven input lines.

    If I fix these issues (keeping $filecount), I get:

    use warnings; use strict; my $infn = 'testdat.dat'; open(my $infh, '<', $infn) or die "$infn: $!"; my $outfh; my $filecount = 0; while ( my $line = <$infh> ) { if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) { close $outfh if $outfh; my $outfn = sprintf '%s-%d.txt', $1, ++$filecount; open($outfh, '>', $outfn) or die "$outfn: $!"; } if ($outfh) { print {$outfh} $line or die "print: $!"; } } close($outfh) if $outfh; close($infh);

    However, this produces output files that are 7 lines or longer, because they include the endoffile marker and any lines following it. If you want the endoffile to be excluded from the output, the fix is fairly simple:

    if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) { ... } elsif ( $line =~ /^endoffile$/ ) { close $outfh; $outfh = undef; }

    If you want something more complex than this, then it'd be better to switch to a state machine type approach, which I showed some templates for at the top of this node.

      Hello, thank you so much for your help and guidance, I am very new to this forum. I was able to get this working, and you are correct, I had filecount in there from the code I was using to start out, and focused mainly on the regex problems. I was able to get it to do exactly what was needed on the test data I presented, then realized the reason it wasn't working on the actual data file I need to parse, is because I have .s in the second word, and that second word always ends with a dot. (.s) This works for the orignal test data:
      use warnings; use strict; my $infn = '/Users/azeller/Documents/Rogers_import/20190822_RR_export- +nrcmd.txt'; open(my $infh, '<', $infn) or die "$infn: $!"; my $outfh; my $filecount = 0; while ( my $line = <$infh> ) { if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) { close $outfh if $outfh; my $outfn = sprintf '%sdb', $1; open($outfh, '>', $outfn) or die "$outfn: $!"; } if ($outfh) { print {$outfh} $line or die "print: $!"; } } close($outfh) if $outfh; close($infh);
      But it doesn't work for the actual data I am parsing. My data file format is actually more like the following:
      one 1file1.nest. 1ss record1a record1b record1c record 1d 2 record empty endoffile zone 2file2.egg. 1ss record1a record1b record1c record 1d 2 record empty endoffile

        Please use <code> tags to format your code and sample input and output.

        I have .s in the second word, and that second word always ends with a dot.

        Now might be a good time to look at perlretut, as jcb suggested, or perhaps perlrequick. The \w+ will only match Word characters (normally [a-zA-Z0-9_] plus Unicode "Word" characters), but not including the dot. Perhaps you want to say "word characters plus dot", i.e. [\w.]+, or simply "any non-whitespace characters", i.e. \S+.

        Update: Edited first sentence that was accidentally cut off.

Re: Using the second word in a line to split a file into multiple files
by jcb (Parson) on Aug 26, 2019 at 03:04 UTC

    You have a stray character in your regex, and your capture groups are not what you seem to think they are. Both of these are very basic and you should be able to find the solutions quickly once you find the problems. Try reading perlretut and perlre in the meantime for further inspiration.

Re: Using the second word in a line to split a file into multiple files
by jwkrahn (Abbot) on Aug 26, 2019 at 22:25 UTC

    This appears to work:

    #!/usr/bin/perl use strict; use warnings; open my $infh, '<', 'testdat.dat' or die "Cannot open 'testdat.dat' be +cause: $!"; while ( my $line = <$infh> ) { if ( $line =~ /^zone\s*(\S+)\s+\S+\s*$/ ) { open my $outfh, '>', $1 or die "Cannot open '$1' because: $!"; select $outfh; } print $line or die "Failed to write to file: $!"; } close $infh;
Re: Using the second word in a line to split a file into multiple files
by Marshall (Canon) on Aug 28, 2019 at 20:13 UTC
    I will demo a "classic way", non-Perl specific way to do this.

    You actually have the easiest record parsing case...there is an easily identifiable line that starts a new record and an easily identifiable line that signals the end of the record. This is easier than situations where there is no <EOR> "end of record" and the end of a record is signaled by the start of a new different record.

    I used the Perl DATA file handle below, but made a my $fh variable out of it so that filehandle could be passed to a subroutine.
    In this algorithm pattern, you loop until you see a start-of-record, then call a subroutine to process that record. The sub returns to the caller and the search for another start-of-record commences.

    #!/usr/bin/perl use strict; use warnings; # create a "my $fh data handle" from DATA # in your code, you would open my $fh to an actual # input file my $fh =\*DATA; while (my $line = <$fh>) { process_record ($fh, $line) if ($line =~ /^zone /); } sub process_record { my ($fh, $line) = @_; my ($filename) = $line =~ /^zone\s+(\w+)/; # open a new output file handle to $filename print "would print to a file called $filename\n"; my $record_line; while (defined($record_line =<$fh>) and $record_line !~ /endoffile +/) { print " $record_line"; #you print to outfile #instead of to STDOUT } # optional close of the output file # closes automatically when its $outfh goes out of scope } =prints would print to a file called 2file2 record1a record1b record1c record 1d 2 record empty =cut __DATA__ #comment line => perhaps a version number? one 1file1.nest. 1ss record1a record1b record1c record 1d 2 record empty endoffile zone 2file2.egg. 1ss record1a record1b record1c record 1d 2 record empty endoffile