deMize has asked for the wisdom of the Perl Monks concerning the following question:

I've tried to simplify the file I want to parse. I hope I didn't over simplify it to the point I negated my dilemma:
RECORD 1 ###### Full Name 1a Street Address 1a City 1a ST1a Zip_1a + COUNTY 1a 0######## Full Name 1b abcABCabc 99/99/9999 Street Address 1b City 1b ST1b Zip_1b + COUNTY 1b RECORD 2 ############ Full Name 2a 99/99/9999 Street Address 2a City 2a ST2a Zip_2a + COUNTY 2a 0### Full Name 2b abcABCabc 99/99/9999 Street Address 2b City 2b ST2b Zip_2b + COUNTY 2b

Notice a few things:
1) The # signs are actually digits
2) Certain lines may be prefixed by an erroneous '0' due to cobalt outputs
3) The two dates are different inputs (sometimes they appear sometimes they don't)
4) There are intricacies that make it not possible to do this with a fixed width grab.

So the following is some code that's pulling the data and storing it in variables. Note: this is in a loop and everything is set up correctly there are so many other lines I didn't include in the file above, and all the variables are storing correctly, it's just the second RegEx inside the if-statement that I'm slipping on.
(variable names and code modified for simplicity)
if ($array[$line] =~ /0?.*?(RECORD .*)/){ $record = trim($1); # works correctly $array[$line+1] =~ /(\d+)(.*)/; $id = trim($1); # works correctly $name = trim($2); # works correctly # still looking at the "a" lines, sometimes there's a date, sometim +es no date $array[$line+2] =~ /.*?(\d{2}\/\d{2}\/\d{4})?(.*)/; $date = trim($1); # when no date it's using the previous $1 that + goes into $id $address = trim($2); # when no date it's using the previous $2 that + goes into $name ... code continues ...
Please understand that this is my best attempt of simplifying my code and the program is a little more intense than I'm able to show you. While I welcome best practices, keep in mind that they may already be in place --- and know I appreciate your help (as always).

Update: I've deduced to the fact that the second '?' after the pattern that looks for the date is not working how I'd like it to.

Replies are listed 'Best First'.
Re: Parsing Regex
by GrandFather (Saint) on Sep 23, 2009 at 02:02 UTC

    Adding some inferred context, but avoiding the implied slurp, the following seems to address the issue:

    use strict; use warnings; use Data::Dump::Streamer; my @record; my @records; while (defined (my $line = <DATA>) or @record) { my $recordStart = (! defined $line) || ($line =~ /(RECORD .*)/); next if ! @record and ! $recordStart; chomp $line if defined $line; if (! $recordStart || ! @record) { push @record, $line; next; } die "Corrupted record: \n" . (join " \n", @record) if @record < 3; my $rec = trim ($1); my ($id, $name) = map {trim ($_)} $record[1] =~ /(\d+)(.*)/; push @records, {rec => $rec, id => $id, name => $name}; $records[-1]{date} = trim ($1) if $record[2] =~ s!^.*?(\d{2}\/\d{2 +}\/\d{4})!!; $records[-1]{address} = trim ($record[2]); @record = defined $line ? ($line) : (); } Dump (\@records); sub trim { my ($str) = @_; return if ! defined $str; $str =~ s/^\s+//; $str =~ s/\s+$//; return $str; } __DATA__ RECORD 1 ###### Full Name 1a Street Address 1a City 1a ST1a Zip_1a + COUNTY 1a 0######## Full Name 1b abcABCabc 99/99/9999 Street Address 1b City 1b ST1b Zip_1b + COUNTY 1b RECORD 2 ############ Full Name 2a 99/99/9999 Street Address 2a City 2a ST2a Zip_2a + COUNTY 2a 0### Full Name 2b abcABCabc 99/99/9999 Street Address 2b City 2b ST2b Zip_2b + COUNTY 2b

    Prints:

    $ARRAY1 = [ { address => 'Street Address 1a', id => 1, name => 'a', rec => 'RECORD 2' }, { address => 'Street Address 2a', date => '99/99/9999', id => 2, name => 'a', rec => undef } ];

    True laziness is hard work
      The method in the post above worked. I'm guessing this is my best option, since it seems I can't do this all in one regex (to my knowledge thus far).

      For those viewing the thread, the method is to substitute (subtract) the date portion from the string, and then use the remaining:
      $date = _trim($1) if $line =~ s/.*(\d{2}\/\d{2}\/\d{4})//; $address = _trim($1) if $line =~ /(.*)/;

         if $line =~ /(.*)/ is redundant. The capture matches the whole string and the match always succeeds (even when $line is undef). Instead just use $address = _trim($line), or $address = _trim($line) if defined $line if $line can be undefined.

        The .* in s/.* is important because it deletes any junk before the date along with the date, leaving just the address for the following code.


        True laziness is hard work
Re: Parsing Regex
by muba (Priest) on Sep 23, 2009 at 00:48 UTC

    Bear with me, it's 2:47 AM localtime, so I may be misinterpreting your question or the case, but how about:

    ($date, $address) = $array[$line+2] =~ /.*?(\d{2}\/\d{2}\/\d{4})?(.*)/ +;
      ($date, $address) = $array[$line+2] =~ /.*?(\d{2}\/\d{2}\/\d{4})(.*)/;
      Removing the last '?', this almost works, but when $1 is empty so is $2. However, if I include the '?' the date is included in $2.

        Seems such a trivial thing, don't you agree?

        Let's give it another shot.

        while (<DATA>) { # lol, comments in __DATA__ :) next if m/^#/; #m!! to allow for better-readable slashes inside the regex #/x modifier to make the regex even better readable ($date, $address) = $_ =~ m! .*? (\d+ / \d+ / \d+)? \s* (.+) !x; print "date:<$date>\naddress:<$address>\n\n"; } __DATA__ # a line without a date A very good looking address # a line with a date 15/5/85 That's my actual birth day!

        Output:

        date:<> address:<A very good looking address> date:<15/5/85> address:<That's my actual birth day!>