Parsing Regex

deMize has asked for the wisdom of the Perl Monks concerning the following question:

I've tried to simplify the file I want to parse. I hope I didn't over simplify it to the point I negated my dilemma:

 RECORD 1
 ######                  Full Name 1a
                         Street Address 1a
                         City 1a                       ST1a   Zip_1a  
+              COUNTY 1a
0########                Full Name 1b
 abcABCabc    99/99/9999 Street Address 1b
                         City 1b                       ST1b   Zip_1b  
+              COUNTY 1b

 RECORD 2
 ############            Full Name 2a
 99/99/9999              Street Address 2a
                         City 2a                       ST2a   Zip_2a  
+              COUNTY 2a
0###                     Full Name 2b
 abcABCabc    99/99/9999 Street Address 2b
                         City 2b                       ST2b   Zip_2b  
+              COUNTY 2b
[download]

Notice a few things:
1) The # signs are actually digits
2) Certain lines may be prefixed by an erroneous '0' due to cobalt outputs
3) The two dates are different inputs (sometimes they appear sometimes they don't)
4) There are intricacies that make it not possible to do this with a fixed width grab.

So the following is some code that's pulling the data and storing it in variables. Note: this is in a loop and everything is set up correctly there are so many other lines I didn't include in the file above, and all the variables are storing correctly, it's just the second RegEx inside the if-statement that I'm slipping on.
(variable names and code modified for simplicity)

if ($array[$line]  =~ /0?.*?(RECORD .*)/){
   $record  = trim($1); # works correctly

   $array[$line+1] =~ /(\d+)(.*)/;
   $id      = trim($1); # works correctly
   $name    = trim($2); # works correctly

   # still looking at the "a" lines, sometimes there's a date, sometim
+es no date
   $array[$line+2] =~ /.*?(\d{2}\/\d{2}\/\d{4})?(.*)/;
   $date    = trim($1); # when no date it's using the previous $1 that
+ goes into $id
   $address = trim($2); # when no date it's using the previous $2 that
+ goes into $name
   ... code continues ...
[download]

Please understand that this is my best attempt of simplifying my code and the program is a little more intense than I'm able to show you. While I welcome best practices, keep in mind that they may already be in place --- and know I appreciate your help (as always).

Update: I've deduced to the fact that the second '?' after the pattern that looks for the date is not working how I'd like it to.

Comment on Parsing Regex Select or Download Code

Replies are listed 'Best First'.
Re: Parsing Regex by GrandFather (Saint) on Sep 23, 2009 at 02:02 UTC
Adding some inferred context, but avoiding the implied slurp, the following seems to address the issue: use strict; use warnings; use Data::Dump::Streamer; my @record; my @records; while (defined (my $line = <DATA>) or @record) { my $recordStart = (! defined $line) \|\| ($line =~ /(RECORD .)/); next if ! @record and ! $recordStart; chomp $line if defined $line; if (! $recordStart \|\| ! @record) { push @record, $line; next; } die "Corrupted record: \n" . (join " \n", @record) if @record < 3; my $rec = trim ($1); my ($id, $name) = map {trim ($_)} $record[1] =~ /(\d+)(.)/; push @records, {rec => $rec, id => $id, name => $name}; $records[-1]{date} = trim ($1) if $record[2] =~ s!^.*?(\d{2}\/\d{2 +}\/\d{4})!!; $records[-1]{address} = trim ($record[2]); @record = defined $line ? ($line) : (); } Dump (\@records); sub trim { my ($str) = @_; return if ! defined $str; $str =~ s/^\s+//; $str =~ s/\s+$//; return $str; } __DATA__ RECORD 1 ###### Full Name 1a Street Address 1a City 1a ST1a Zip_1a + COUNTY 1a 0######## Full Name 1b abcABCabc 99/99/9999 Street Address 1b City 1b ST1b Zip_1b + COUNTY 1b RECORD 2 ############ Full Name 2a 99/99/9999 Street Address 2a City 2a ST2a Zip_2a + COUNTY 2a 0### Full Name 2b abcABCabc 99/99/9999 Street Address 2b City 2b ST2b Zip_2b + COUNTY 2b [download] Prints: `$ARRAY1 = [ { address => 'Street Address 1a', id => 1, name => 'a', rec => 'RECORD 2' }, { address => 'Street Address 2a', date => '99/99/9999', id => 2, name => 'a', rec => undef } ];` [download] True laziness is hard work	[reply] [d/l] [select]
Re^2: Parsing Regex by deMize (Monk) on Sep 23, 2009 at 14:25 UTC
The method in the post above worked. I'm guessing this is my best option, since it seems I can't do this all in one regex (to my knowledge thus far). For those viewing the thread, the method is to substitute (subtract) the date portion from the string, and then use the remaining: `$date = _trim($1) if $line =~ s/.(\d{2}\/\d{2}\/\d{4})//; $address = _trim($1) if $line =~ /(.)/;` [download]	[reply] [d/l]
Re^3: Parsing Regex by GrandFather (Saint) on Sep 23, 2009 at 20:25 UTC
`if $line =~ /(.)/` is redundant. The capture matches the whole string and the match always succeeds (even when $line is undef). Instead just use `$address = _trim($line)`, or `$address = _trim($line) if defined $line` if $line can be undefined. The . in `s/.*` is important because it deletes any junk before the date along with the date, leaving just the address for the following code. True laziness is hard work	[reply] [d/l] [select]
Re: Parsing Regex by muba (Priest) on Sep 23, 2009 at 00:48 UTC
Bear with me, it's 2:47 AM localtime, so I may be misinterpreting your question or the case, but how about: `($date, $address) = $array[$line+2] =~ /.?(\d{2}\/\d{2}\/\d{4})?(.)/ +;` [download]	[reply] [d/l]
Re^2: Parsing Regex by deMize (Monk) on Sep 23, 2009 at 01:02 UTC
`($date, $address) = $array[$line+2] =~ /.?(\d{2}\/\d{2}\/\d{4})(.)/;` Removing the last '?', this almost works, but when $1 is empty so is $2. However, if I include the '?' the date is included in $2.	[reply] [d/l]
Re^3: Parsing Regex by muba (Priest) on Sep 23, 2009 at 01:12 UTC
Seems such a trivial thing, don't you agree? Let's give it another shot. `while (<DATA>) { # lol, comments in __DATA__ :) next if m/^#/; #m!! to allow for better-readable slashes inside the regex #/x modifier to make the regex even better readable ($date, $address) = $_ =~ m! .? (\d+ / \d+ / \d+)? \s (.+) !x; print "date:<$date>\naddress:<$address>\n\n"; } __DATA__ # a line without a date A very good looking address # a line with a date 15/5/85 That's my actual birth day!` [download] Output: `date:<> address:<A very good looking address> date:<15/5/85> address:<That's my actual birth day!>` [download]	[reply] [d/l] [select]
Re^4: Parsing Regex by deMize (Monk) on Sep 23, 2009 at 13:48 UTC
Re^5: Parsing Regex by deMize (Monk) on Sep 23, 2009 at 14:56 UTC
Re^4: Parsing Regex by deMize (Monk) on Sep 23, 2009 at 05:34 UTC