in reply to Parsing Regex

Adding some inferred context, but avoiding the implied slurp, the following seems to address the issue:

use strict; use warnings; use Data::Dump::Streamer; my @record; my @records; while (defined (my $line = <DATA>) or @record) { my $recordStart = (! defined $line) || ($line =~ /(RECORD .*)/); next if ! @record and ! $recordStart; chomp $line if defined $line; if (! $recordStart || ! @record) { push @record, $line; next; } die "Corrupted record: \n" . (join " \n", @record) if @record < 3; my $rec = trim ($1); my ($id, $name) = map {trim ($_)} $record[1] =~ /(\d+)(.*)/; push @records, {rec => $rec, id => $id, name => $name}; $records[-1]{date} = trim ($1) if $record[2] =~ s!^.*?(\d{2}\/\d{2 +}\/\d{4})!!; $records[-1]{address} = trim ($record[2]); @record = defined $line ? ($line) : (); } Dump (\@records); sub trim { my ($str) = @_; return if ! defined $str; $str =~ s/^\s+//; $str =~ s/\s+$//; return $str; } __DATA__ RECORD 1 ###### Full Name 1a Street Address 1a City 1a ST1a Zip_1a + COUNTY 1a 0######## Full Name 1b abcABCabc 99/99/9999 Street Address 1b City 1b ST1b Zip_1b + COUNTY 1b RECORD 2 ############ Full Name 2a 99/99/9999 Street Address 2a City 2a ST2a Zip_2a + COUNTY 2a 0### Full Name 2b abcABCabc 99/99/9999 Street Address 2b City 2b ST2b Zip_2b + COUNTY 2b

Prints:

$ARRAY1 = [ { address => 'Street Address 1a', id => 1, name => 'a', rec => 'RECORD 2' }, { address => 'Street Address 2a', date => '99/99/9999', id => 2, name => 'a', rec => undef } ];

True laziness is hard work

Replies are listed 'Best First'.
Re^2: Parsing Regex
by deMize (Monk) on Sep 23, 2009 at 14:25 UTC
    The method in the post above worked. I'm guessing this is my best option, since it seems I can't do this all in one regex (to my knowledge thus far).

    For those viewing the thread, the method is to substitute (subtract) the date portion from the string, and then use the remaining:
    $date = _trim($1) if $line =~ s/.*(\d{2}\/\d{2}\/\d{4})//; $address = _trim($1) if $line =~ /(.*)/;

       if $line =~ /(.*)/ is redundant. The capture matches the whole string and the match always succeeds (even when $line is undef). Instead just use $address = _trim($line), or $address = _trim($line) if defined $line if $line can be undefined.

      The .* in s/.* is important because it deletes any junk before the date along with the date, leaving just the address for the following code.


      True laziness is hard work