LoneRanger has asked for the wisdom of the Perl Monks concerning the following question:

I have to parse through an email that is sent out daily. The people who send it out, just copy and paste from websites around the internet so the formatting is terrible. It contains various articles about different health related topics.

The emails are always different except for: a block of text with all the article titles, 10 = signs to separate the titles and the articles, each article has a title that is always in all caps, and is then followed by a header of sorts with source information, etc., and then the actual article, followed by 2 \n. (an example is below).

I need to know how to approach this and perhaps some methods to figure this problem out. I'm currently thinking that this can only be solved by implementing a state machine, but I'm not sure.

Thanks, LoneRanger

Cyclosporiasis: Ontario
Cyclosporiasis: Guatemala


November 26, 1999
Infectious Disease News Brief
Health Canada
An outbreak of enteric infection due to Cyclospora cayetanensis diarrhea
occurred in Ontario in the spring of 1999, the fourth consecutive year of
spring-time outbreaks of this parasitic infection in this province. The

November 26, 1999
Infectious Disease News Brief
Health Canada
CDC conducted a study in health-care facilities and among raspberry farm

Replies are listed 'Best First'.
Re: Random email parsing
by ZZamboni (Curate) on Jun 08, 2001 at 18:54 UTC
    Parsing "free form" documents is always tricky, and you may have to end up encoding some special cases. But the trick is to find regularities. As a first step, here's my take:

    The news items themselves include the titles, so I would say you can just skip everything up to the line of equal signs. Then, you can read each news item as a paragraph, and consider the first line to be the title.

    The following snippet of code stores the news items in %db, using the title as the key, containing the "body" (they could as well be stored in an array, if you want to preserve the order).

    use strict; my $f=0; my %db; $/=""; while (<>) { $f=1,next if /^==========/; next unless $f; my @item=split /\n/, $_, 2; $db{$item[0]}=$item[1]; } foreach (keys %db) { print "Title: $_\nBody: $db{$_}"; }
    A further step would be to parse the body. There again, the trick is to find any regularities. In the example data you gave, there are 3 lines of "headers" followed by the text. If this is always the case, something like this could do the trick:
    @body=split /\n/, $body, 4;
    And you would end up with the three headers in @body[0,1,2] and the text in $body. If the "three header lines" rule does not apply, you could use some other heuristic. For example, are header lines always less than 40 characters in length? Then you could use something like this: (untested):
    my @lines=split /\n/, $body; my @hdr; my $l; while ($l=shift(@lines)) { last if length($l)>40 push @hdr, $l; } $body=join("\n", $l, @lines);
    Which would leave all the initial shorter-than-40 character lines in @hdr, and the rest re-joined with newlines in $body.


Re: Random email parsing
by Odud (Pilgrim) on Jun 08, 2001 at 19:32 UTC
    If the titles always appear at first in their own block then it would be useful to extract these first and store them away somewhere. Then as you process each "message" block you could mark each title to show that you have found and processed the corresponding message. This would give you a useful check at the end that you haven't missed something.

    A different way of looking at things would be to provide a form based way for the users to input the article. Then you could have fields for title, source, date, text, etc. And of course you have more control and can do validation and reformatting at the data entry point.
Re: Random email parsing
by asiufy (Monk) on Jun 08, 2001 at 18:41 UTC
    I definetely suggest you use a state machine. It'll help you with the maintenance of the parser in case the format changes (and chances are that it probably will).