perlperlperl has asked for the wisdom of the Perl Monks concerning the following question:

Ex) Name: anon \n Phone: ### \n Address: 222 road \n Name: anon1 \n Phone: ###1 \n Address: 333 road \n

If I want to match the above record format, i.e. Name, Phone, Address (separated by NL), in a file with a whole bunch of these, what should I do? The way I do it right now is to read file into array, then use join operator on array to assign lines to a scalar, then use a multi line regex that scans for the record pattern of Name Phone Address using the multi line match modifier. I am successful at this. But is this the right way? I am concerned that using the join operator to assign thousands of lines to a scalar and then doing a multi line match for the pattern is not portable because not all implementations may be able to store thousands of lines in a scalar using join

  • Comment on proper way of matching multiple line patterns

Replies are listed 'Best First'.
Re: proper way of matching multiple line patterns
by Eliya (Vicar) on Dec 30, 2011 at 03:50 UTC

    If each record always starts with "Name:", you could use that as the input record separator, aka $/, and then just read the records one by one from the file...

      I usually use Moritz's approach but that's actually a very sexy idea! Me likes.
Re: proper way of matching multiple line patterns
by moritz (Cardinal) on Dec 30, 2011 at 05:08 UTC

    If the format is line-based, it is often easier to do line-based processing, and add a bit extra logic that is not inside a regex.

    For example you could use an approach like this:

    use strict; use warnings; use Data::Dumper; sub use_the_data { my $d = shift; print Dumper $d; } my %d; while (<DATA>) { chomp; my ($key, $value) = split /:\s*/; if (%d && $key eq 'Name') { use_the_data \%d; # reset %d %d = ($key => $value); } else { $d{$key} = $value } } use_the_data \%d; __DATA__ Name: 123 Phone: foo Name: blubb Phone: blah
Re: proper way of matching multiple line patterns
by TJPride (Pilgrim) on Dec 30, 2011 at 06:41 UTC
    You may be overthinking this. Records are probably not going to be more than like 100-150 characters, so unless you have hundreds of thousands of records, you're not likely to run into memory problems. But yes, as other people have mentioned, you can set the line separator to Name, like so:
    use strict; use warnings; use Data::Dumper; my $first = 'Name'; # Name of first field in each record my %data; $/ = "$first: "; <DATA>; while (<DATA>) { chomp; $_ = $/ . $_; %data = (); $data{$1} = $2 while m/^(.*?): (.*?)$/mg; print Dumper(\%data); } __DATA__ Name: Theodore Pride Phone: (911) 911-9111 Address: 1234 Road Name: Theodore Pride Phone: (911) 911-9111 Address: 1234 Road
Re: proper way of matching multiple line patterns
by Marshall (Canon) on Dec 30, 2011 at 13:20 UTC
    I hope that replies have been helpful so far. The common theme that connects the replies is that since you are dealing with address records, you should parse the input so that you have one record per name. This simplifies the search regex. And instead of doing one match for some huge string, you iterate over the records, applying the search regex(s) to each record.

    Presumably the search result will be a complete record, or a partial record. Do the record separation on input rather than in each regex search.

    The records could be stored as an array of stings (simple @record_as_string) where each element is one string representing the whole record. Or an Array of Hash (AoH) - that's what TJPride did (or close) instead of the print, just: push @AoH, \%data;

    I think its appropriate to mention than in addition to re-defining the input record separator to be "Name:", you can also set it($/) to undef. If you do that then the entire file can be "slurped" into one variable without doing all the concatenate stuff. But I don't think that is what you need.

    my $all_data; { local $/ = undef; #no separator means whole file $all_data = <DATA>; } # now $/ is back to what it was before # that is what the local within a lexical scope did
    I also really doubt that you are going to run into a memory problem. A 10MB file in the format that you have would be no problem at all - and would have a LOT of addresses! There are ways to solve any "memory problem", but I don't think that memory is even close to being an issue according to your description. If the data set is a 100MB file, the we probably ought to talk more.

    If you are familiar with 'C', the Perl Array of Hash, is very similar to the 'C' Array of Structure. Lots of streets are named after people. Geez, how many "Martin Luther King" boulevards are there? If you go this way, it will be easier to "fine tune" your search regex'es to the data that is relevant.

    There are two great parsing techniques. Both are fine.

    If you need help on searching the array structure (however you choose to do it), ask again.