http://qs1969.pair.com?node_id=86919


in reply to Random email parsing

Parsing "free form" documents is always tricky, and you may have to end up encoding some special cases. But the trick is to find regularities. As a first step, here's my take:

The news items themselves include the titles, so I would say you can just skip everything up to the line of equal signs. Then, you can read each news item as a paragraph, and consider the first line to be the title.

The following snippet of code stores the news items in %db, using the title as the key, containing the "body" (they could as well be stored in an array, if you want to preserve the order).

use strict; my $f=0; my %db; $/=""; while (<>) { $f=1,next if /^==========/; next unless $f; my @item=split /\n/, $_, 2; $db{$item[0]}=$item[1]; } foreach (keys %db) { print "Title: $_\nBody: $db{$_}"; }
A further step would be to parse the body. There again, the trick is to find any regularities. In the example data you gave, there are 3 lines of "headers" followed by the text. If this is always the case, something like this could do the trick:
@body=split /\n/, $body, 4;
And you would end up with the three headers in @body[0,1,2] and the text in $body. If the "three header lines" rule does not apply, you could use some other heuristic. For example, are header lines always less than 40 characters in length? Then you could use something like this: (untested):
my @lines=split /\n/, $body; my @hdr; my $l; while ($l=shift(@lines)) { last if length($l)>40 push @hdr, $l; } $body=join("\n", $l, @lines);
Which would leave all the initial shorter-than-40 character lines in @hdr, and the rest re-joined with newlines in $body.

--ZZamboni