http://qs1969.pair.com?node_id=569180

SamCG has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse the bodies of some automated emails. One would think it would be easy, but for some reason the generator of the emails does NOT cut lines using a \n, but by adding spaces. The number of spaces added seems to vary.

Initially I broke the email into an array by splitting on \s{15,}, but this isn't ideal, and drops some of the values. I'm considering splitting on the colons, but I'm not convinced this is a great idea and seems to lead to more headaches. Any ideas for a somewhat robust, straightforward way to parse this?

Security : BULGY N V- + + + + + + + + + + Item Overridden : Earnings Per Share + + + + + + + + + + Initial Value : (USD) + + + + + + + + + + Current Value : () + + + + + + + + + + Overridden Value : 160 (USD) + + + + + + + + + + Effective : 08/20/1999 through 08/20/2000 + + + + + + + + + + Override Type : Data SecurityID : 1076665 Sedol : 2451234 Cusip : N66696606 ISIN : NL0006122988
Update: Ah, I've found a potential way. while ($bdy=~/(.*?):\s(.*?)\s\s/g) seems to work alright. Comments on this approach?



-----------------
s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;

Replies are listed 'Best First'.
Re: Parsing semi-erratic text
by ikegami (Patriarch) on Aug 23, 2006 at 19:39 UTC

    The data you submitted *does* have newlines.

    while (<DATA>) { my ($key, $val) = /^\s*([^:]*?)\s*:\s*(.*?)\s*$/ or next; print("[$key:$val]\n"); }

    and

    while (<DATA>) { my ($key, $val) = split(/:/, $_, 2); next if not defined $val; s/^\s+//, s/\s+$// for $key, $val; print("[$key:$val]\n"); }

    both do the trick.

      Hrmm...perhaps an effect of my cutting and pasting? The body of my email gets read into a variable (so it's like slurping a file). I can't seem to split on newlines, and if I use a regex to count I get only one in each email (which I presume is at the end).

      Thank you for the implicit character class ([^:]) suggestion, by the way. I hate putting .* into regexes, even with the non-greedy modifier.



      -----------------
      s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;
        I agree. .* and .*? usually/often assume the data is formatted correctly.
Re: Parsing semi-erratic text
by GrandFather (Saint) on Aug 23, 2006 at 21:07 UTC

    Not quite. Consider (note I've trimmed the number of trailing spaces and retained the line ends (but strip them):

    use strict; use warnings; use Date::EzDate; my $str = <<DATA; Security : BULGY N V- Item Overridden : Earnings Per Share Initial Value : (USD) Current Value : () Overridden Value : 160 (USD) Effective : 08/20/1999 through 08/20/2000 Override Type : Data SecurityID : 1076665 Sedol : 2451234 Cusip : N66696606 ISIN : NL0006122988 DATA $str =~ s/\n//g; while ($str =~ /(.*?):\s(.*?)\s\s/g) { my ($key, $value) = ($1, $2); $key =~ s/^\s*//; $key =~ s/\s*$//; $value =~ s/^\s*//; $value =~ s/\s*$//; print ">$key: $value<\n"; }

    Prints:

    >Security: BULGY N V-< >Item Overridden: Earnings Per Share< >Initial Value: (USD)< >Current Value: ()< >Overridden Value: 160 (USD)< >Effective: 08/20/1999 through 08/20/2000< >Override Type: Data SecurityID< >: 1076665 Sedol< >: 2451234 Cusip< >: N66696606 ISIN< >: NL0006122988<

    The regex /(\w[\w ]{17}):\s+((?:(?!\w[\w ]{17}:).)*)/g latches on to a 18 character wide label preceeding a : and then grabs characters upto the next label field. The result is:

    >Security: BULGY N V-< >Item Overridden: Earnings Per Share< >Initial Value: (USD)< >Current Value: ()< >Overridden Value: 160 (USD)< >Effective: 08/20/1999 through 08/20/2000< >Override Type: Data< >SecurityID: 1076665< >Sedol: 2451234< >Cusip: N66696606< >ISIN: NL0006122988<

    DWIM is Perl's answer to Gödel