Re: using lookaround assertions to grab info
by BrowserUk (Patriarch) on Jun 03, 2004 at 21:39 UTC
|
#! perl -slw
use strict;
use Data::Dumper;
my $m = <<'EOM';
Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45
Address: 26800 BRADLEY RD Subdivsn:
Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO.
: TICKET EXPIRES AFTER 04/22/04
Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A
EOM
my %parts;
$parts{ $1 } = $2 while $m =~ m[
(?: \A | \n ) ( [^:]+ ) \s* :
(.*?)
(?= (?: \n \S [^:]* : ) | \Z )
]gxs;
print Dumper \%parts;
__END__
P:\test>360501
$VAR1 = {
'Address' => ' 26800 BRADLEY RD Subdivs
+n:',
'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG
+C0A',
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N
+O.
: TICKET EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time:
+10:45'
};
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
|
|
drats... I composed a reply to this and then clicked somewhere else and lost it. Here is my second try...
Your code seems to produce correct values, but not quite. More on that in a bit. But, since I am an acknowledged noob, I will have to spend quite a bit of time staring at...
$parts{ $1 } = $2 while $m =~ m[
(?: \A | \n ) ( [^:]+ ) \s* :
(.*?)
(?= (?: \n \S [^:]* : ) | \Z )
]gxs;
...to figure out what is going on. I will do that and hopefully learn something, but at first glance it seems a bit beyond me for now.
That said, the result is not what I want. Here is how --
# You have
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO.
: TICKET EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45'
#
# I want
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. TICKET
+EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081',
'Prior' => 2,
'Digstrt' => '03/30/04',
'Time' => '10:45'
All that said, Roy Johnson's suggestion of splitting the lines on /\n\b/ set me on the right path and did the trick.
Thanks. | [reply] [d/l] [select] |
|
|
I too thought that Roy Johnstone's split /\n\b/, ... was inspired. I wish I had thought of it:)
In terms of breaking down my code. The basic statement is pretty simple. It's just an 'add an element to the hash using $1 and $2 while the regex matches'.
$hash{ $1 } = $2 while $data =~ m[(...): (...)]g
The only complicated bit is the regex itself, which uses a lookahead (as you suggested) to determine the end of each multi-line record.
The options: /g, match as many times as you can; /x, ignor whitespace and comments; /s, allow '.' to match newlines so that we can pick up your multi-line bits.
m[
# First we want the key, the text preceding the :
(?: \A | \n ) ## from the start the string or a newline
( [^:]+? ) ## capture everyline upto the : into $1
\s* ## but throw away any trailing spaces
: ## preceding the :
# Now grab everything (including newlines) into $2
(.*?)
# but stop if we find a newline followed
# by a non-space preceding a :
# or the end of string for the last record.
(?= # lookahead
(?: # non-capture group containing
\n # a newline
\S # follow by a non-space
[^:]* # and anything except a :
: # and a :
)
| # OR
\Z # the EOS
)
]gxs;
As for removing the extraneuos stuff, incorporating Roy Johnstone's simplification, I'd do it like this.
#! perl -slw
use strict;
use Data::Dumper;
my $m = <<'EOM';
Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45
Address: 26800 BRADLEY RD Subdivsn:
Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO.
: TICKET EXPIRES AFTER 04/22/04
Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A
EOM
my %parts;
while(
$m =~ m[
(?: \A | \n ) ( [^:]+? ) \s* :
(.*?)
(?= (?: \n \b ) | \Z )
]gxs
) {
my( $key, $value ) = ( $1, $2 );
$value =~ s[\n\s+:][]g;
$parts{ $key } = $value;
}
print Dumper \%parts;
__END__
P:\test>360501
$VAR1 = {
'Address' => ' 26800 BRADLEY RD Subdivs
+n:',
'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG
+C0A',
'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N
+O. TICKET EXPIRES AFTER 04/22/04',
'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time:
+10:45'
};
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
Re: using lookaround assertions to grab info
by Roy Johnson (Monsignor) on Jun 03, 2004 at 21:39 UTC
|
| [reply] [d/l] |
|
|
: sometext
I believe that when I encounter a line like above, I have to lookbehind and say, "Ha, this line doesn't begin with a label, hence it is just a continuation of the value of the previous label."
This is where I am lost. | [reply] [d/l] |
|
|
| [reply] |
|
|
Re: using lookaround assertions to grab info
by Ven'Tatsu (Deacon) on Jun 03, 2004 at 21:26 UTC
|
I would usualy keep a state variable that lets me know what section of the input I'm in.
my $section = ''; #remember the last section label we encountered
foreach (@m) {
if ($_ =~ /^Dig No\s:\s(\w*)\s*Prior:\s*([0-9]*)\s*Digstrt:\s*([0-9]{
+2}\/[0-9]{2}\/[0-9]{2})\s*Time:\s*([0-9]{2}:[0-9]{2})/) {
$section = 'Dig No';
$m{'DIG_NO' } = $1;
$m{'PRIORITY'} = $2;
$m{'DIGDATE' } = $3;
$m{'DIGTIME' } = $4;
} elsif ($_ =~ /^Address\s*:\s*(.*)Subdivsn/) {
$section = 'Address';
$m{'ADDRESS' } = $1;
} elsif ($_ =~ /^Remarks\s*:\s*(.*)/ || $section eq 'Remarks') { #do
+this if we enter the section or were already in the section
$section = 'Remarks';
$m{'REMARKS' } = $1;
}
}
Note that this code is quite simple and will only work if only one section continues accross multiple lines, if you need more than one sections that handles multiple lines the same basic idea can work, but it takes more work. | [reply] [d/l] |
|
|
| [reply] |
|
|
my $section = ''; #remember the last section label we encountered
foreach (@m) {
if (/^Dig No\s:\s(\w*)\s*Prior:\s*([0-9]*)\s*Digstrt:\s*([0-9]{
+2}\/[0-9]{2}\/[0-9]{2})\s*Time:\s*([0-9]{2}:[0-9]{2})/) {
$section = 'DIGTIME';
$m{'DIG_NO' } = $1;
$m{'PRIORITY'} = $2;
$m{'DIGDATE' } = $3;
$m{'DIGTIME' } = $4;
} elsif (/^Address\s*:\s*(.*)Subdivsn/) {
$section = 'ADDRESS';
$m{'ADDRESS' } = $1;
} elsif (/^Remarks\s*:\s*(.*)/ ) {
$section = 'REMARKS';
$m{'REMARKS' } = $1;
} elsif (/^\s*:\s*(.+?)\s*/) {
$m{$section} .= $1;
}
}
| [reply] [d/l] |
|
|
Have you though of extracting the match before the if elsif ... in the foreach loop?
my $section = '';
foreach (@m) {
if (/^\s*([\w\s]*?)\s*:/ && $1) { #if we matched and we captured a s
+ection label
$section = $1;
}
if ($section eq '...') {
...
}
| [reply] [d/l] [select] |
|
|
|
|
Re: using lookaround assertions to grab info
by dragonchild (Archbishop) on Jun 04, 2004 at 01:56 UTC
|
BrowserUk's solution is good. Another, if his code scares you, is to do something like
my @lines = split /\n/, $m;
my %m;
my $curr_index;
foreach my $line (@lines)
{
$line =~ /^\s*(\w+)?\s*:\s*(.*)$/;
next unless $1 || $2;
unless (defined $curr_index)
{
die "First line doesn't have a key\n" unless $1;
$m{$1} = $2;
$curr_index = $1;
next;
}
if ($1)
{
$m{$1} = $2;
$curr_index = $1;
next;
}
$m{$curr_index} .= "\n$2";
}
Untested, but it should work.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] [d/l] |
|
|
see my note above regarding Roy Johnson's suggestion. The key point was that any "label" whose value doesn't fit on one line, the value just continues on a subsequent line with an indented colon.
label one: somevalue label two: a very long value
: that does not fit on one line so it continues
: on another line
label thr: someother value
splitting the scalar on /\n\b/ does the trick. It "slurps" the subsequent lines that don't start with a word into the previous label's value. From then on it is just a matter of removing the newlines and the redundant colons.
That said, I have much to learn from your regexp
$line =~ /^\s*(\w+)?\s*:\s*(.*)$/;
next unless $1 || $2;
very neat use of 'next unless'.
Thanks for the help. | [reply] [d/l] [select] |
|
|
Some thoughts about the /\n\b/ idea. It is very inspired, and I ++'ed it. But, it will fail in the following circumstances:
- If you are running on Unix and your email was received on a Mac (or Windows) and copied over using Samba or something similar. (The \n will not match the line ending.)
- If the email has a space at the beginning of a line with a key. (Mine handles this correctly, as does BrowserUk's.)
- Be absolutely sure you know what \b matches. It is a zero-width assertion that matches the boundary between \w\W or \W\w. \w is (basically) [a-zA-Z0-9_]. So, if one of your labels starts with a quote, it won't match.
Now, if your situation avoids the above pitfalls, go right ahead. Comment it, though. If we, with large number of combined years of experience, consider it inspired, your maintenance programmer will consider it demonic and worthy of tracking you down with a bloody axe.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] [d/l] |