matching keyword in multi-line records

arthur99 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a file that is something like this:

 
2008-10-01 message 1
2008-10-02 message 2
2008-10-03 multi-line message
         message 3 keyword
2008-10-04 message 4
2008-10-05 multi-line
         blah blah
         message 5 keyword junk
2008-10-06 message 6 blah keyword
2008-10-06 this is message 7
[download]

I want to grab all message lines that contain "keyword":

2008-10-03 multi-line message
         message 3 keyword
2008-10-05 multi-line
         blah blah
         message 5 keyword junk
2008-10-06 message 6 blah keyword
[download]

I tried adding the /m and /s modifiers to my regex but I couldn't get it working. I'm a perl newbie, so I must have botched something in my regex.

undef $/;
while ( /^(\d+-\d+-\d+.*?keyword)(?!\d+-\d+)$/smg ) {
      print "$1\n";
}
[download]

Thanks,

Arthur

Comment on matching keyword in multi-line records Select or Download Code

Replies are listed 'Best First'.
Re: matching keyword in multi-line records by JavaFan (Canon) on Oct 20, 2008 at 20:58 UTC
Please, please, please, next time if you have a question, not only post your code, also post the output, and the reason why the output isn't the output you want. Your time isn't more valuable than ours, and now people will have to cut and paste your code and run it to see what it does. You get more and better answers if post output as well. And considering that your program does produce output, we just have to guess why it's wrong. Your regexp is too greedy. It'll match something starting with a date, and then matches up to a 'keyword' at the end of the line. Including more lines with dates. You might want to try: `/^(\d+-\d+-\d+.(?:\n\s+.)keyword.)/mg` [download]	[reply] [d/l]
Re: matching keyword in multi-line records by moritz (Cardinal) on Oct 20, 2008 at 20:49 UTC
You do `undef $/`, but where do you actually read the contents into `$_`? Anyway, this works for me: `#!/usr/bin/perl use strict; use warnings; $_ = do { local $/; <DATA> }; while ( /^(\d+-\d+-\d+(?:(?!\d+-\d+).)keyword)(?:(?!\d+-\d+).)?$/smg + ) { print "$1\n"; } __DATA__ 2008-10-01 message 1 2008-10-02 message 2 2008-10-03 multi-line message message 3 keyword 2008-10-04 message 4 2008-10-05 multi-line blah blah message 5 keyword junk 2008-10-06 message 6 blah keyword 2008-10-06 this is message 7` [download]	[reply] [d/l] [select]
Re: matching keyword in multi-line records by billward (Initiate) on Oct 20, 2008 at 23:29 UTC
Clearly the code snippet you gave is incomplete. The problem here is probably that you have entries that span multiple lines. If the data is (and always will be) relatively small, the easiest way is probably to just load everything into an array. When the line starts with whitespace you would add it to the end of the previous entry, and otherwise you would start a new entry. Then you can go through the array with grep or foreach to find what you want. This is especially useful if you need to do multiple searches, as file operations are slower than in-memory ones. Something like this: `my @stuff; while (<>) { if (/^\s/) { $stuff[-1] .= $_; } else { push @stuff, $_; } } print grep { /keyword/ } @stuff;` [download] If the data are large though, that would gobble up too much memory. In that case I'd go for something more like this, especially if you just need to do the search once: `my $last_entry; while (<>) { if (/^\s/) { $last_entry .= $_; } else { print $last_entry if $last_entry =~ /keyword/; $last_entry = $_; } print $last_entry if $last_entry =~ /keyword/ && eof(IN); }` [download]	[reply] [d/l] [select]
Re: matching keyword in multi-line records (slurp--) by tye (Sage) on Oct 21, 2008 at 02:13 UTC
Slupring the whole file into memory at once tends to often suck. And avoiding that also means you can avoid complicated regexes (using a very SMoP instead). `my $message= ''; while( <> ) { if( ! /^\d{4}-\d\d-\d\d / ) { $message .= $_; } else { print $message if( $message =~ /keyword/ ); $message= $_; } } print $message if( $message =~ /keyword/ );` [download] Or, if you prefer: `my $line= <>; while( defined $line ) { my $message= ''; do { $message .= $line; } while( defined( $line= <> ) && $line !~ /^\d{4}-\d\d-\d\d / ); print $message if( $message =~ /keyword/ ); }` [download] - tye	[reply] [d/l] [select]
Re^2: matching keyword in multi-line records (slurp--) by arthur99 (Initiate) on Oct 21, 2008 at 04:09 UTC
Thanks to all you wise and benevolent monks for your replies. In reviewing your solutions, it's obvious that I'm still quite the neophyte! Sorry about not posting the output -- it was an oversight and I apologize for that. I'm beggging for clemency as this is my first post :-) Many thanks again to all of you.	[reply]
Re: matching keyword in multi-line records by graff (Chancellor) on Oct 21, 2008 at 01:33 UTC
I'm with moritz: if the file is not too big, reading it into a single scalar string is the way to go (once you actually do read the file). But unlike moritz, I would split on the date pattern, using parens in the split regex so that the dates get returned along with the stuff around them: `use strict; my $text = do { local $/; <> }; my $date_rgx = qr/\d{4}-\d{2}-\d{2}/; my ( $date, $text ) = ( '', '' ); for ( split /\n($date_rgx)/, $text ) { # capture the date (but not th +e preceding \n) if ( /^$date_rgx$/ ) { $date = $_; } elsif ( /\bkeyword\b/ ) { print "$date$_\n"; } }` [download] (not tested; updated to use a variable other than $_ to hold the original file data, and avoid possible confusion in the "for" loop.)	[reply] [d/l]