Re: Using the second word in a line to split a file into multiple files

There are unfortunately several issues with your code:

You're using the second capture group as the filename, which according to the regex is actually the third word in the line. You probably meant something like /^zone\s+(\w+)\W+\w+\s*$/, and then you can use the capture variable $1.
As jcb mentioned, you've got a stray > in the regex.
sprintf($outfh'.txt',... is not valid syntax. Please make sure to always provide a runnable SSCCE when asking questions.
I don't understand why you have a $filecount variable, since according to your description, you want the filenames to be derived from just the section names.
Using the variable $outfh to hold both the filename and the filehandle is not good practice, you should use a separate variable to hold the filename.
You print to the output file unconditionally, even when it's not open yet. The same goes for the close.
You say you want the output files for this example input to contain six lines each, but you don't say which of the seven input lines.

If I fix these issues (keeping $filecount), I get:

use warnings;
use strict;

my $infn = 'testdat.dat';
open(my $infh, '<', $infn) or die "$infn: $!";

my $outfh;
my $filecount = 0;
while ( my $line = <$infh> ) {
    if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) {
        close $outfh if $outfh;
        my $outfn = sprintf '%s-%d.txt', $1, ++$filecount;
        open($outfh, '>', $outfn) or die "$outfn: $!";
    }
    if ($outfh) {
        print {$outfh} $line or die "print: $!";
    }
}

close($outfh) if $outfh;
close($infh);
[download]

However, this produces output files that are 7 lines or longer, because they include the endoffile marker and any lines following it. If you want the endoffile to be excluded from the output, the fix is fairly simple:

    if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) {
        ...
    }
    elsif ( $line =~ /^endoffile$/ ) {
        close $outfh;
        $outfh = undef;
    }
[download]

If you want something more complex than this, then it'd be better to switch to a state machine type approach, which I showed some templates for at the top of this node.

Comment on Re: Using the second word in a line to split a file into multiple files Select or Download Code

Replies are listed 'Best First'.
Re^2: Using the second word in a line to split a file into multiple files by az1962 (Initiate) on Aug 26, 2019 at 14:35 UTC
Hello, thank you so much for your help and guidance, I am very new to this forum. I was able to get this working, and you are correct, I had filecount in there from the code I was using to start out, and focused mainly on the regex problems. I was able to get it to do exactly what was needed on the test data I presented, then realized the reason it wasn't working on the actual data file I need to parse, is because I have .s in the second word, and that second word always ends with a dot. (.s) This works for the orignal test data: `use warnings; use strict; my $infn = '/Users/azeller/Documents/Rogers_import/20190822_RR_export- +nrcmd.txt'; open(my $infh, '<', $infn) or die "$infn: $!"; my $outfh; my $filecount = 0; while ( my $line = <$infh> ) { if ( $line =~ /^zone\s+(\w+)\W+\w+\s*$/ ) { close $outfh if $outfh; my $outfn = sprintf '%sdb', $1; open($outfh, '>', $outfn) or die "$outfn: $!"; } if ($outfh) { print {$outfh} $line or die "print: $!"; } } close($outfh) if $outfh; close($infh);` [download] But it doesn't work for the actual data I am parsing. My data file format is actually more like the following: `one 1file1.nest. 1ss record1a record1b record1c record 1d 2 record empty endoffile zone 2file2.egg. 1ss record1a record1b record1c record 1d 2 record empty endoffile` [download]	[reply] [d/l] [select]
Re^3: Using the second word in a line to split a file into multiple files by haukex (Archbishop) on Aug 26, 2019 at 14:42 UTC
Please use `<code>` tags to format your code and sample input and output. I have .s in the second word, and that second word always ends with a dot. Now might be a good time to look at perlretut, as jcb suggested, or perhaps perlrequick. The `\w+` will only match Word characters (normally `[a-zA-Z0-9_]` plus Unicode "Word" characters), but not including the dot. Perhaps you want to say "word characters plus dot", i.e. `[\w.]+`, or simply "any non-whitespace characters", i.e. `\S+`. Update: Edited first sentence that was accidentally cut off.	[reply] [d/l] [select]