using lookaround assertions to grab info

punkish has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: using lookaround assertions to grab info by BrowserUk (Patriarch) on Jun 03, 2004 at 21:39 UTC
This might get you started? #! perl -slw use strict; use Data::Dumper; my $m = <<'EOM'; Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45 Address: 26800 BRADLEY RD Subdivsn: Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04 Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A EOM my %parts; $parts{ $1 } = $2 while $m =~ m[ (?: \A \| \n ) ( [^:]+ ) \s* : (.?) (?= (?: \n \S [^:] : ) \| \Z ) ]gxs; print Dumper \%parts; __END__ P:\test>360501 $VAR1 = { 'Address' => ' 26800 BRADLEY RD Subdivs +n:', 'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG +C0A', 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N +O. : TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: +10:45' }; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re^2: using lookaround assertions to grab info by punkish (Priest) on Jun 04, 2004 at 02:39 UTC
drats... I composed a reply to this and then clicked somewhere else and lost it. Here is my second try... Your code seems to produce correct values, but not quite. More on that in a bit. But, since I am an acknowledged noob, I will have to spend quite a bit of time staring at... `$parts{ $1 } = $2 while $m =~ m[ (?: \A \| \n ) ( [^:]+ ) \s* : (.?) (?= (?: \n \S [^:] : ) \| \Z ) ]gxs;` [download] ...to figure out what is going on. I will do that and hopefully learn something, but at first glance it seems a bit beyond me for now. That said, the result is not what I want. Here is how -- `# You have 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45' # # I want 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. TICKET +EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081', 'Prior' => 2, 'Digstrt' => '03/30/04', 'Time' => '10:45'` [download] All that said, Roy Johnson's suggestion of splitting the lines on /\n\b/ set me on the right path and did the trick. Thanks.	[reply] [d/l] [select]
Re^3: using lookaround assertions to grab info by BrowserUk (Patriarch) on Jun 04, 2004 at 03:22 UTC
I too thought that Roy Johnstone's `split /\n\b/, ...` was inspired. I wish I had thought of it:) In terms of breaking down my code. The basic statement is pretty simple. It's just an 'add an element to the hash using $1 and $2 while the regex matches'. `$hash{ $1 } = $2 while $data =~ m[(...): (...)]g` [download] The only complicated bit is the regex itself, which uses a lookahead (as you suggested) to determine the end of each multi-line record. The options: /g, match as many times as you can; /x, ignor whitespace and comments; /s, allow '.' to match newlines so that we can pick up your multi-line bits. m[ # First we want the key, the text preceding the : (?: \A \| \n ) ## from the start the string or a newline ( [^:]+? ) ## capture everyline upto the : into $1 \s* ## but throw away any trailing spaces : ## preceding the : # Now grab everything (including newlines) into $2 (.?) # but stop if we find a newline followed # by a non-space preceding a : # or the end of string for the last record. (?= # lookahead (?: # non-capture group containing \n # a newline \S # follow by a non-space [^:] # and anything except a : : # and a : ) \| # OR \Z # the EOS ) ]gxs; [download] As for removing the extraneuos stuff, incorporating Roy Johnstone's simplification, I'd do it like this. #! perl -slw use strict; use Data::Dumper; my $m = <<'EOM'; Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45 Address: 26800 BRADLEY RD Subdivsn: Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04 Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A EOM my %parts; while( $m =~ m[ (?: \A \| \n ) ( [^:]+? ) \s* : (.*?) (?= (?: \n \b ) \| \Z ) ]gxs ) { my( $key, $value ) = ( $1, $2 ); $value =~ s[\n\s+:][]g; $parts{ $key } = $value; } print Dumper \%parts; __END__ P:\test>360501 $VAR1 = { 'Address' => ' 26800 BRADLEY RD Subdivs +n:', 'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG +C0A', 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N +O. TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: +10:45' }; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: using lookaround assertions to grab info by Roy Johnson (Monsignor) on Jun 03, 2004 at 21:39 UTC
Perhaps you want to split on `/\n\b/`. That will require new records to begin with a word char (because \n is not a word char). The PerlMonk `tr///` Advocate	[reply] [d/l]
Re^2: using lookaround assertions to grab info by punkish (Priest) on Jun 03, 2004 at 21:49 UTC
Perhaps you want to split on /\n\b/. That will require new records to begin with a word char (because \n is not a word char). I am not sure how this helps me. In the example I have provided above, I have added the newlines. In reality, the email message is provided to me as a scalar, with all the newlines and all in it already. I just use Mail::IMAPClient to grab the message. As I said, the code I have above works fine even with splitting on newlines. Where it fails is its inability to "lookbehind" when it encounters a line without a label -- something that looks like so -- `: sometext` [download] I believe that when I encounter a line like above, I have to lookbehind and say, "Ha, this line doesn't begin with a label, hence it is just a continuation of the value of the previous label." This is where I am lost.	[reply] [d/l]
Re^3: using lookaround assertions to grab info by Roy Johnson (Monsignor) on Jun 03, 2004 at 22:00 UTC
In reality, the email message is provided to me as a scalar, with all the newlines and all in it already. And you want to split on those newlines, except where the next line doesn't begin with a letter. You want those lines that don't begin with a letter not to be split from the previous lines. Maybe you should just try it and see what you get. The PerlMonk `tr///` Advocate	[reply]
Re^4: using lookaround assertions to grab info by punkish (Priest) on Jun 03, 2004 at 22:08 UTC
Re: using lookaround assertions to grab info by Ven'Tatsu (Deacon) on Jun 03, 2004 at 21:26 UTC
I would usualy keep a state variable that lets me know what section of the input I'm in. my $section = ''; #remember the last section label we encountered foreach (@m) { if ($_ =~ /^Dig No\s:\s(\w)\sPrior:\s([0-9])\sDigstrt:\s([0-9]{ +2}\/[0-9]{2}\/[0-9]{2})\sTime:\s([0-9]{2}:[0-9]{2})/) { $section = 'Dig No'; $m{'DIG_NO' } = $1; $m{'PRIORITY'} = $2; $m{'DIGDATE' } = $3; $m{'DIGTIME' } = $4; } elsif ($_ =~ /^Address\s:\s(.)Subdivsn/) { $section = 'Address'; $m{'ADDRESS' } = $1; } elsif ($_ =~ /^Remarks\s:\s(.)/ \|\| $section eq 'Remarks') { #do +this if we enter the section or were already in the section $section = 'Remarks'; $m{'REMARKS' } = $1; } } [download] Note that this code is quite simple and will only work if only one section continues accross multiple lines, if you need more than one sections that handles multiple lines the same basic idea can work, but it takes more work.	[reply] [d/l]
Re^2: using lookaround assertions to grab info by punkish (Priest) on Jun 03, 2004 at 21:35 UTC
Note that this code is quite simple and will only work if only one section continues accross multiple lines, if you need more than one sections that handles multiple lines the same basic idea can work, but it takes more work. Thanks for the advice. I did think of such an approach and then discarded it for the very reason you state above. I'll look at it again and see if I can finagle something useful. I guess the best way to state the problem is that the value of any label continues until a new label is encountered even if \n is encountered on the way. The labels are distinguished by \s\w\s:	[reply]
Re^3: using lookaround assertions to grab info by ryantate (Friar) on Jun 03, 2004 at 21:48 UTC
Why not my $section = ''; #remember the last section label we encountered foreach (@m) { if (/^Dig No\s:\s(\w)\sPrior:\s([0-9])\sDigstrt:\s([0-9]{ +2}\/[0-9]{2}\/[0-9]{2})\sTime:\s([0-9]{2}:[0-9]{2})/) { $section = 'DIGTIME'; $m{'DIG_NO' } = $1; $m{'PRIORITY'} = $2; $m{'DIGDATE' } = $3; $m{'DIGTIME' } = $4; } elsif (/^Address\s:\s(.)Subdivsn/) { $section = 'ADDRESS'; $m{'ADDRESS' } = $1; } elsif (/^Remarks\s:\s(.)/ ) { $section = 'REMARKS'; $m{'REMARKS' } = $1; } elsif (/^\s:\s(.+?)\s*/) { $m{$section} .= $1; } } [download]	[reply] [d/l]
Re^3: using lookaround assertions to grab info by Ven'Tatsu (Deacon) on Jun 03, 2004 at 21:45 UTC
Have you though of extracting the match before the `if elsif ...` in the foreach loop? `my $section = ''; foreach (@m) { if (/^\s([\w\s]?)\s*:/ && $1) { #if we matched and we captured a s +ection label $section = $1; } if ($section eq '...') { ... }` [download]	[reply] [d/l] [select]
Re^4: using lookaround assertions to grab info by ryantate (Friar) on Jun 03, 2004 at 21:57 UTC
Re^5: using lookaround assertions to grab info by Ven'Tatsu (Deacon) on Jun 03, 2004 at 22:23 UTC
Re: using lookaround assertions to grab info by dragonchild (Archbishop) on Jun 04, 2004 at 01:56 UTC
BrowserUk's solution is good. Another, if his code scares you, is to do something like `my @lines = split /\n/, $m; my %m; my $curr_index; foreach my $line (@lines) { $line =~ /^\s(\w+)?\s:\s(.)$/; next unless $1 \|\| $2; unless (defined $curr_index) { die "First line doesn't have a key\n" unless $1; $m{$1} = $2; $curr_index = $1; next; } if ($1) { $m{$1} = $2; $curr_index = $1; next; } $m{$curr_index} .= "\n$2"; }` [download] Untested, but it should work. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]
Re^2: using lookaround assertions to grab info by punkish (Priest) on Jun 04, 2004 at 02:45 UTC
see my note above regarding Roy Johnson's suggestion. The key point was that any "label" whose value doesn't fit on one line, the value just continues on a subsequent line with an indented colon. `label one: somevalue label two: a very long value : that does not fit on one line so it continues : on another line label thr: someother value` [download] splitting the scalar on /\n\b/ does the trick. It "slurps" the subsequent lines that don't start with a word into the previous label's value. From then on it is just a matter of removing the newlines and the redundant colons. That said, I have much to learn from your regexp `$line =~ /^\s(\w+)?\s:\s(.)$/; next unless $1 \|\| $2;` [download] very neat use of 'next unless'. Thanks for the help.	[reply] [d/l] [select]
Re^3: using lookaround assertions to grab info by dragonchild (Archbishop) on Jun 04, 2004 at 12:48 UTC
Some thoughts about the /\n\b/ idea. It is very inspired, and I ++'ed it. But, it will fail in the following circumstances: If you are running on Unix and your email was received on a Mac (or Windows) and copied over using Samba or something similar. (The \n will not match the line ending.) If the email has a space at the beginning of a line with a key. (Mine handles this correctly, as does BrowserUk's.) Be absolutely sure you know what \b matches. It is a zero-width assertion that matches the boundary between \w\W or \W\w. \w is (basically) `[a-zA-Z0-9_]`. So, if one of your labels starts with a quote, it won't match. Now, if your situation avoids the above pitfalls, go right ahead. Comment it, though. If we, with large number of combined years of experience, consider it inspired, your maintenance programmer will consider it demonic and worthy of tracking you down with a bloody axe. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]