punkish has asked for the wisdom of the Perl Monks concerning the following question:

I have the following code (contrived example) --
my $m = ""; $m .= "Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45\n"; $m .= "Address: 26800 BRADLEY RD Subdivsn: \n"; $m .= "Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. \n"; $m .= " : TICKET EXPIRES AFTER 04/22/04 \n"; $m .= "Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A \n"; my @m = split("\n", $m); # A hash to hold the extracted fields my %m; foreach (@m) { if ($_ =~ /^Dig No\s:\s(\w*)\s*Prior:\s*([0-9]*)\s*Digstrt:\s*([0-9]{ +2}\/[0-9]{2}\/[0-9]{2})\s*Time:\s*([0-9]{2}:[0-9]{2})/) { $m{'DIG_NO' } = $1; $m{'PRIORITY'} = $2; $m{'DIGDATE' } = $3; $m{'DIGTIME' } = $4; } elsif ($_ =~ /^Address\s*:\s*(.*)Subdivsn/) { $m{'ADDRESS' } = $1; } elsif ($_ =~ /^Remarks\s*:\s*(.*)/) { $m{'REMARKS' } = $1; } }
I get the $m scalar from an email containing the entire body of the email. I split the scalar on newlines, loop over the resulting array, match the labels (text before ':') to get their values (text after the ':'). It works well.

The problem is thusly -- "Remarks" is actually 2 lines long in the above example. In reality it may be any number of lines long. How do I grab all the lines until the next label (in this case "Members", but could be anything...) and make them the value of "Remarks"? Essentially, if a line starts with a

: sometext here
then it really is just a continuation of the value of the last label.

I have being reading the lookahead and lookbehind portions of the Camel's pattern matching chapter, and I feel the answer lies there. But I have not had the right brainwaves thus far.

Guidance will be much appreciated.

Replies are listed 'Best First'.
Re: using lookaround assertions to grab info
by BrowserUk (Patriarch) on Jun 03, 2004 at 21:39 UTC

    This might get you started?

    #! perl -slw use strict; use Data::Dumper; my $m = <<'EOM'; Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45 Address: 26800 BRADLEY RD Subdivsn: Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04 Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A EOM my %parts; $parts{ $1 } = $2 while $m =~ m[ (?: \A | \n ) ( [^:]+ ) \s* : (.*?) (?= (?: \n \S [^:]* : ) | \Z ) ]gxs; print Dumper \%parts; __END__ P:\test>360501 $VAR1 = { 'Address' => ' 26800 BRADLEY RD Subdivs +n:', 'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG +C0A', 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N +O. : TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: +10:45' };

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      drats... I composed a reply to this and then clicked somewhere else and lost it. Here is my second try...

      Your code seems to produce correct values, but not quite. More on that in a bit. But, since I am an acknowledged noob, I will have to spend quite a bit of time staring at...

      $parts{ $1 } = $2 while $m =~ m[ (?: \A | \n ) ( [^:]+ ) \s* : (.*?) (?= (?: \n \S [^:]* : ) | \Z ) ]gxs;
      ...to figure out what is going on. I will do that and hopefully learn something, but at first glance it seems a bit beyond me for now.

      That said, the result is not what I want. Here is how --

      # You have 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45' # # I want 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. TICKET +EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081', 'Prior' => 2, 'Digstrt' => '03/30/04', 'Time' => '10:45'
      All that said, Roy Johnson's suggestion of splitting the lines on /\n\b/ set me on the right path and did the trick.

      Thanks.

        I too thought that Roy Johnstone's split /\n\b/, ... was inspired. I wish I had thought of it:)

        In terms of breaking down my code. The basic statement is pretty simple. It's just an 'add an element to the hash using $1 and $2 while the regex matches'.

        $hash{ $1 } = $2 while $data =~ m[(...): (...)]g

        The only complicated bit is the regex itself, which uses a lookahead (as you suggested) to determine the end of each multi-line record.

        The options: /g, match as many times as you can; /x, ignor whitespace and comments; /s, allow '.' to match newlines so that we can pick up your multi-line bits.

        m[ # First we want the key, the text preceding the : (?: \A | \n ) ## from the start the string or a newline ( [^:]+? ) ## capture everyline upto the : into $1 \s* ## but throw away any trailing spaces : ## preceding the : # Now grab everything (including newlines) into $2 (.*?) # but stop if we find a newline followed # by a non-space preceding a : # or the end of string for the last record. (?= # lookahead (?: # non-capture group containing \n # a newline \S # follow by a non-space [^:]* # and anything except a : : # and a : ) | # OR \Z # the EOS ) ]gxs;

        As for removing the extraneuos stuff, incorporating Roy Johnstone's simplification, I'd do it like this.

        #! perl -slw use strict; use Data::Dumper; my $m = <<'EOM'; Dig No : A081 Prior: 2 Digstrt: 03/30/04 Time: 10:45 Address: 26800 BRADLEY RD Subdivsn: Remarks: DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=NO. : TICKET EXPIRES AFTER 04/22/04 Members: ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSGC0A EOM my %parts; while( $m =~ m[ (?: \A | \n ) ( [^:]+? ) \s* : (.*?) (?= (?: \n \b ) | \Z ) ]gxs ) { my( $key, $value ) = ( $1, $2 ); $value =~ s[\n\s+:][]g; $parts{ $key } = $value; } print Dumper \%parts; __END__ P:\test>360501 $VAR1 = { 'Address' => ' 26800 BRADLEY RD Subdivs +n:', 'Members' => ' ABTL0A AMTCHA CECO5A COMC4A ITHA0A LKFO0A NSG +C0A', 'Remarks' => ' DIRECTIONAL BORING=NO. DEPTH EXCEEDS 7 FEET=N +O. TICKET EXPIRES AFTER 04/22/04', 'Dig No ' => ' A081 Prior: 2 Digstrt: 03/30/04 Time: +10:45' };

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: using lookaround assertions to grab info
by Roy Johnson (Monsignor) on Jun 03, 2004 at 21:39 UTC
    Perhaps you want to split on /\n\b/. That will require new records to begin with a word char (because \n is not a word char).

    The PerlMonk tr/// Advocate
      Perhaps you want to split on /\n\b/. That will require new records to begin with a word char (because \n is not a word char).
      I am not sure how this helps me. In the example I have provided above, I have added the newlines. In reality, the email message is provided to me as a scalar, with all the newlines and all in it already. I just use Mail::IMAPClient to grab the message.

      As I said, the code I have above works fine even with splitting on newlines. Where it fails is its inability to "lookbehind" when it encounters a line without a label -- something that looks like so --

      : sometext
      I believe that when I encounter a line like above, I have to lookbehind and say, "Ha, this line doesn't begin with a label, hence it is just a continuation of the value of the previous label."

      This is where I am lost.

        In reality, the email message is provided to me as a scalar, with all the newlines and all in it already.
        And you want to split on those newlines, except where the next line doesn't begin with a letter. You want those lines that don't begin with a letter not to be split from the previous lines.

        Maybe you should just try it and see what you get.


        The PerlMonk tr/// Advocate
Re: using lookaround assertions to grab info
by Ven'Tatsu (Deacon) on Jun 03, 2004 at 21:26 UTC
    I would usualy keep a state variable that lets me know what section of the input I'm in.
    my $section = ''; #remember the last section label we encountered foreach (@m) { if ($_ =~ /^Dig No\s:\s(\w*)\s*Prior:\s*([0-9]*)\s*Digstrt:\s*([0-9]{ +2}\/[0-9]{2}\/[0-9]{2})\s*Time:\s*([0-9]{2}:[0-9]{2})/) { $section = 'Dig No'; $m{'DIG_NO' } = $1; $m{'PRIORITY'} = $2; $m{'DIGDATE' } = $3; $m{'DIGTIME' } = $4; } elsif ($_ =~ /^Address\s*:\s*(.*)Subdivsn/) { $section = 'Address'; $m{'ADDRESS' } = $1; } elsif ($_ =~ /^Remarks\s*:\s*(.*)/ || $section eq 'Remarks') { #do +this if we enter the section or were already in the section $section = 'Remarks'; $m{'REMARKS' } = $1; } }
    Note that this code is quite simple and will only work if only one section continues accross multiple lines, if you need more than one sections that handles multiple lines the same basic idea can work, but it takes more work.
      Note that this code is quite simple and will only work if only one section continues accross multiple lines, if you need more than one sections that handles multiple lines the same basic idea can work, but it takes more work.
      Thanks for the advice. I did think of such an approach and then discarded it for the very reason you state above. I'll look at it again and see if I can finagle something useful.

      I guess the best way to state the problem is that the value of any label continues until a new label is encountered even if \n is encountered on the way. The labels are distinguished by \s*\w*\s:

        Why not
        my $section = ''; #remember the last section label we encountered foreach (@m) { if (/^Dig No\s:\s(\w*)\s*Prior:\s*([0-9]*)\s*Digstrt:\s*([0-9]{ +2}\/[0-9]{2}\/[0-9]{2})\s*Time:\s*([0-9]{2}:[0-9]{2})/) { $section = 'DIGTIME'; $m{'DIG_NO' } = $1; $m{'PRIORITY'} = $2; $m{'DIGDATE' } = $3; $m{'DIGTIME' } = $4; } elsif (/^Address\s*:\s*(.*)Subdivsn/) { $section = 'ADDRESS'; $m{'ADDRESS' } = $1; } elsif (/^Remarks\s*:\s*(.*)/ ) { $section = 'REMARKS'; $m{'REMARKS' } = $1; } elsif (/^\s*:\s*(.+?)\s*/) { $m{$section} .= $1; } }
        Have you though of extracting the match before the if elsif ... in the foreach loop?
        my $section = ''; foreach (@m) { if (/^\s*([\w\s]*?)\s*:/ && $1) { #if we matched and we captured a s +ection label $section = $1; } if ($section eq '...') { ... }
Re: using lookaround assertions to grab info
by dragonchild (Archbishop) on Jun 04, 2004 at 01:56 UTC
    BrowserUk's solution is good. Another, if his code scares you, is to do something like
    my @lines = split /\n/, $m; my %m; my $curr_index; foreach my $line (@lines) { $line =~ /^\s*(\w+)?\s*:\s*(.*)$/; next unless $1 || $2; unless (defined $curr_index) { die "First line doesn't have a key\n" unless $1; $m{$1} = $2; $curr_index = $1; next; } if ($1) { $m{$1} = $2; $curr_index = $1; next; } $m{$curr_index} .= "\n$2"; }

    Untested, but it should work.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      see my note above regarding Roy Johnson's suggestion. The key point was that any "label" whose value doesn't fit on one line, the value just continues on a subsequent line with an indented colon.
      label one: somevalue label two: a very long value : that does not fit on one line so it continues : on another line label thr: someother value
      splitting the scalar on /\n\b/ does the trick. It "slurps" the subsequent lines that don't start with a word into the previous label's value. From then on it is just a matter of removing the newlines and the redundant colons.

      That said, I have much to learn from your regexp

      $line =~ /^\s*(\w+)?\s*:\s*(.*)$/; next unless $1 || $2;
      very neat use of 'next unless'.

      Thanks for the help.

        Some thoughts about the /\n\b/ idea. It is very inspired, and I ++'ed it. But, it will fail in the following circumstances:
        1. If you are running on Unix and your email was received on a Mac (or Windows) and copied over using Samba or something similar. (The \n will not match the line ending.)
        2. If the email has a space at the beginning of a line with a key. (Mine handles this correctly, as does BrowserUk's.)
        3. Be absolutely sure you know what \b matches. It is a zero-width assertion that matches the boundary between \w\W or \W\w. \w is (basically) [a-zA-Z0-9_]. So, if one of your labels starts with a quote, it won't match.

        Now, if your situation avoids the above pitfalls, go right ahead. Comment it, though. If we, with large number of combined years of experience, consider it inspired, your maintenance programmer will consider it demonic and worthy of tracking you down with a bloody axe.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested