Re^2: using lookaround assertions to grab info

drats... I composed a reply to this and then clicked somewhere else and lost it. Here is my second try...

Your code seems to produce correct values, but not quite. More on that in a bit. But, since I am an acknowledged noob, I will have to spend quite a bit of time staring at...

$parts{ $1 } = $2 while $m =~ m[
    (?: \A | \n ) ( [^:]+ ) \s* : 
    (.*?)
    (?= (?: \n \S [^:]* : ) | \Z )
]gxs;
[download]

...to figure out what is going on. I will do that and hopefully learn something, but at first glance it seems a bit beyond me for now.

That said, the result is not what I want. Here is how --

# You have
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO.
       : TICKET EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081   Prior: 2     Digstrt: 03/30/04  Time: 10:45'
#
# I want
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. TICKET 
+EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081',
'Prior' => 2,     
'Digstrt' => '03/30/04',
'Time' => '10:45'
[download]

All that said, Roy Johnson's suggestion of splitting the lines on /\n\b/ set me on the right path and did the trick.

Thanks.

Comment on Re^2: using lookaround assertions to grab info Select or Download Code

Replies are listed 'Best First'.
Re^3: using lookaround assertions to grab info by BrowserUk (Patriarch) on Jun 04, 2004 at 03:22 UTC
I too thought that Roy Johnstone's `split /\n\b/, ...` was inspired. I wish I had thought of it:) In terms of breaking down my code. The basic statement is pretty simple. It's just an 'add an element to the hash using $1 and $2 while the regex matches'. `$hash{ $1 } = $2 while $data =~ m[(...): (...)]g` [download] The only complicated bit is the regex itself, which uses a lookahead (as you suggested) to determine the end of each multi-line record. The options: /g, match as many times as you can; /x, ignor whitespace and comments; /s, allow '.' to match newlines so that we can pick up your multi-line bits. m[ # First we want the key, the text preceding the : (?: \A \| \n ) ## from the start the string or a newline ( [^:]+? ) ## capture everyline upto the : into $1 \s* ## but throw away any trailing spaces : ## preceding the : # Now grab everything (including newlines) into $2 (.?) # but stop if we find a newline followed # by a non-space preceding a : # or the end of string for the last record. (?= # lookahead (?: # non-capture group containing \n # a newline \S # follow by a non-space [^:] # and anything except a : : # and a : ) \| # OR \Z # the EOS ) ]gxs; [download] As for removing the extraneuos stuff, incorporating Roy Johnstone's simplification, I'd do it like this. #! perl -slw use strict; use Data::Dumper; my $m = <<'EOM'; Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45 Address: 26800 BRADLEY RD Subdivsn: Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04 Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A EOM my %parts; while( $m =~ m[ (?: \A \| \n ) ( [^:]+? ) \s* : (.*?) (?= (?: \n \b ) \| \Z ) ]gxs ) { my( $key, $value ) = ( $1, $2 ); $value =~ s[\n\s+:][]g; $parts{ $key } = $value; } print Dumper \%parts; __END__ P:\test>360501 $VAR1 = { 'Address' => ' 26800 BRADLEY RD Subdivs +n:', 'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG +C0A', 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N +O. TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: +10:45' }; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: using lookaround assertions to grab info
by BrowserUk (Patriarch) on Jun 04, 2004 at 03:22 UTC

I too thought that Roy Johnstone's split /\n\b/, ... was inspired. I wish I had thought of it:)

In terms of breaking down my code. The basic statement is pretty simple. It's just an 'add an element to the hash using $1 and $2 while the regex matches'.

$hash{ $1 } = $2 while $data =~ m[(...): (...)]g
[download]

The only complicated bit is the regex itself, which uses a lookahead (as you suggested) to determine the end of each multi-line record.

The options: /g, match as many times as you can; /x, ignor whitespace and comments; /s, allow '.' to match newlines so that we can pick up your multi-line bits.

m[
# First we want the key, the text preceding the : 
    (?: \A | \n ) ## from the start the string or a newline
    ( [^:]+? )    ## capture everyline upto the : into $1
    \s*           ## but throw away any trailing spaces
    :             ## preceding the :

# Now grab everything (including newlines) into $2
    (.*?)

# but stop if we find a newline followed 
# by a non-space preceding a :
# or the end of string for the last record.
    (?= # lookahead 
        (?:       # non-capture group containing
           \n     # a newline
           \S     # follow by a non-space
           [^:]*  # and anything except a :
           :      # and a :
        ) 
    |             # OR
       \Z         # the EOS
    )
]gxs;
[download]

As for removing the extraneuos stuff, incorporating Roy Johnstone's simplification, I'd do it like this.

#! perl -slw
use strict;
use Data::Dumper;

my $m = <<'EOM';
Dig No : A081   Prior: 2     Digstrt: 03/30/04  Time: 10:45
Address: 26800 BRADLEY RD                      Subdivsn:
Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO.
       : TICKET EXPIRES AFTER 04/22/04
Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A
EOM

my %parts;

while(
    $m =~ m[
        (?: \A | \n ) ( [^:]+? ) \s* : 
        (.*?)
        (?= (?: \n \b ) | \Z )
    ]gxs 
) {
    my( $key, $value ) = ( $1, $2 );
    $value =~ s[\n\s+:][]g;
    $parts{ $key } = $value;    
}

print Dumper \%parts;

__END__
P:\test>360501
$VAR1 = {
          'Address' => ' 26800 BRADLEY RD                      Subdivs
+n:',
          'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG
+C0A',
          'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N
+O. TICKET EXPIRES AFTER 04/22/04',
          'Dig No ' => ' A081   Prior: 2     Digstrt: 03/30/04  Time: 
+10:45'
        };
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

[reply]
[d/l]
[select]