proper way of matching multiple line patterns

perlperlperl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: proper way of matching multiple line patterns by Eliya (Vicar) on Dec 30, 2011 at 03:50 UTC
If each record always starts with "Name:", you could use that as the input record separator, aka `$/`, and then just read the records one by one from the file...	[reply] [d/l]
Re^2: proper way of matching multiple line patterns by mbethke (Hermit) on Dec 30, 2011 at 05:47 UTC
I usually use Moritz's approach but that's actually a very sexy idea! Me likes.	[reply]
Re: proper way of matching multiple line patterns by moritz (Cardinal) on Dec 30, 2011 at 05:08 UTC
If the format is line-based, it is often easier to do line-based processing, and add a bit extra logic that is not inside a regex. For example you could use an approach like this: `use strict; use warnings; use Data::Dumper; sub use_the_data { my $d = shift; print Dumper $d; } my %d; while (<DATA>) { chomp; my ($key, $value) = split /:\s*/; if (%d && $key eq 'Name') { use_the_data \%d; # reset %d %d = ($key => $value); } else { $d{$key} = $value } } use_the_data \%d; __DATA__ Name: 123 Phone: foo Name: blubb Phone: blah` [download] Perl 6 - second systems done right	[reply] [d/l]
Re: proper way of matching multiple line patterns by TJPride (Pilgrim) on Dec 30, 2011 at 06:41 UTC
You may be overthinking this. Records are probably not going to be more than like 100-150 characters, so unless you have hundreds of thousands of records, you're not likely to run into memory problems. But yes, as other people have mentioned, you can set the line separator to Name, like so: `use strict; use warnings; use Data::Dumper; my $first = 'Name'; # Name of first field in each record my %data; $/ = "$first: "; <DATA>; while (<DATA>) { chomp; $_ = $/ . $_; %data = (); $data{$1} = $2 while m/^(.?): (.?)$/mg; print Dumper(\%data); } __DATA__ Name: Theodore Pride Phone: (911) 911-9111 Address: 1234 Road Name: Theodore Pride Phone: (911) 911-9111 Address: 1234 Road` [download]	[reply] [d/l]
Re: proper way of matching multiple line patterns by Marshall (Canon) on Dec 30, 2011 at 13:20 UTC
I hope that replies have been helpful so far. The common theme that connects the replies is that since you are dealing with address records, you should parse the input so that you have one record per name. This simplifies the search regex. And instead of doing one match for some huge string, you iterate over the records, applying the search regex(s) to each record. Presumably the search result will be a complete record, or a partial record. Do the record separation on input rather than in each regex search. The records could be stored as an array of stings (simple @record_as_string) where each element is one string representing the whole record. Or an Array of Hash (AoH) - that's what TJPride did (or close) instead of the print, just: push @AoH, \%data; I think its appropriate to mention than in addition to re-defining the input record separator to be "Name:", you can also set it($/) to undef. If you do that then the entire file can be "slurped" into one variable without doing all the concatenate stuff. But I don't think that is what you need. `my $all_data; { local $/ = undef; #no separator means whole file $all_data = <DATA>; } # now $/ is back to what it was before # that is what the local within a lexical scope did` [download] I also really doubt that you are going to run into a memory problem. A 10MB file in the format that you have would be no problem at all - and would have a LOT of addresses! There are ways to solve any "memory problem", but I don't think that memory is even close to being an issue according to your description. If the data set is a 100MB file, the we probably ought to talk more. If you are familiar with 'C', the Perl Array of Hash, is very similar to the 'C' Array of Structure. Lots of streets are named after people. Geez, how many "Martin Luther King" boulevards are there? If you go this way, it will be easier to "fine tune" your search regex'es to the data that is relevant. There are two great parsing techniques. Both are fine. If you need help on searching the array structure (however you choose to do it), ask again.	[reply] [d/l]