legend has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to read a file and capture particular lines into different strings:
LENGTH: Some Content here TEXT: Some Content Here COMMENT: Some Content Here
I want to be able to get (LENGTH: .... ) into one array and so on... I'm trying to use PERL in slurp mode but for some reason I'm having trouble. I solved the problem using "read the file line by line" technique but I want to use the slurp mode to solve it because the input is something like this:
LENGTH: ...................................................... .................................................................. .................................................................. ................................................................... .................................................................. SUBJECT: ....................................................... COMMENT: ..................................................... ....................................................................
As you can observe, the data that I want is not limited to one line but rather spans multiple lines so thought using a regex over multiple lines would be better. Do you have any suggestion on how to solve this problem?

Replies are listed 'Best First'.
Re: I need to regex multiple lines
by hipowls (Curate) on Feb 26, 2008 at 07:48 UTC

    There are two regex modifiers that help. /s so that . matches \n and /m so that ^ and $ match at the start and end of logical lines. /x allows free formating and embedded comments.

    Using these modifiers you regex can be written as

    my @array = $line =~ m{ ^(LENGTH: # a line beginning with LENGTH: .*?) # as little as possible until ^(SUBJECT: # a line beginning with SUBJECT: .*?) # as little as possible until ^(COMMENT: # a line beggining with COMMENT .*) # and the rest }msx
    which will create an array with
    $VAR1 = [ 'LENGTH: ................................................... +... .................................................................. .................................................................. ................................................................... .................................................................. ', 'SUBJECT: .................................................. +..... ', 'COMMENT: .................................................. +... .................................................................... ' ];

Re: I need to regex multiple lines
by jwkrahn (Abbot) on Feb 26, 2008 at 07:39 UTC

    If you want lines into strings:

    my ( $key, %data ); while ( <FH> ) { if ( /^([^:]+):/ ) { $data{ $key = $1 } = $_; } else { $data{ $key } .= $_; } }

    If you want lines into arrays:

    my ( $key, %data ); while ( <FH> ) { if ( /^([^:]+):/ ) { $data{ $key = $1 } = [ $_ ]; } else { push @{ $data{ $key } }, $_; } }

    If you just want an array instead of a hash:

    my ( $key, @data ); while ( <FH> ) { if ( /^([^:]+):/ ) { push @data, [ $_ ]; } else { push @{ $data[ -1 ] }, $_; } }

    Update: and with a single array:

    my ( $key, @data ); while ( <FH> ) { if ( /^([^:]+):/ ) { push @data, $_; } else { $data[ -1 ] .= $_; } }
      Hi,
      jwkrahn++, but only if there are no further colons in the text...
      Regards,
      svenXY
Re: I need to regex multiple lines
by grizzley (Chaplain) on Feb 26, 2008 at 07:56 UTC

    Assuming that input starts with some uppercase key and that your keys are uppercase strings I would suggest following:

    my $key; while(<>) { if(s/^([A-Z]+)://) { $key = $1 } push @{$hash{$key}}, $_ } # use array consisting 'LENGTH' print @{$hash{'LENGTH'}}
Re: I need to regex multiple lines
by Erez (Priest) on Feb 26, 2008 at 12:54 UTC
The requested "slurp-solution"
by rminner (Chaplain) on Feb 26, 2008 at 09:36 UTC
    Comment: The end of string \Z in $capture is needed because of the lookahead(?=) in the regex. Without it it wouldn't match the last entry.
    use strict;
    use warnings;
    use File::Slurp;
    
    my $wholefile = read_file('data.txt');
    my $capture = qr{(LENGTH|SUBJECT|COMMENT|\Z)};
    
    while ($wholefile =~ m{^$capture:(.*?)(?=$capture)}smgcx) { my ($type , $data) = ($1 , $2); print "Type: $type\n"; print "Data: $data\n"; }
Re: I need to regex multiple lines
by locked_user sundialsvc4 (Abbot) on Feb 26, 2008 at 16:16 UTC

    Yet another approach to consider is one that might be used say with Perl's little brother, awk. This tool is based on the idea of “here's a bunch of regular-expressions and code-blocks. For each line, find the matching expression(s) and do what they say.” Importantly, there is also a BEGIN block that's executed before the first line, and an END block that's executed afterwards. (Yes, this is where Larry Wall got that idea...)

    So what you can do is to define a “state machine” of sorts. For instance, when you see a line that starts with 'LENGTH' you go into this mode; when you see 'SUBJECT' you go into that mode, and so-on. The “mode” value then tells you what to do with each line that does not match any of these; say, a line consisting of dots.

    What's “the right way” to do it? Of course there is none. But this approach is useful to put into your thinking-cap when you must deal with a more complicated issue such as parsing a printed-output file.

    Finally, for very complicated inputs, you can actually use a true parser.

Re: I need to regex multiple lines
by legend (Sexton) on Feb 26, 2008 at 19:41 UTC
    Wow... SO many solutions.. Thank you all so much. The $/ interests me a lot. I have read the article and decided upon a delimiter but I'm not really sure how to make it work. I'm trying:
    open(IN, "filename.txt"); $/ = '/delimiter here/'; while(<>) { }
    But I'm confused, how do I read chunks of data and operate upon them? I mean, grab the chunk with the delimted text and then perform some regex matching on it..