I have to parse through an email that is sent out daily. The people who send it out, just copy and paste from websites around the internet so the formatting is terrible. It contains various articles about different health related topics.

The emails are always different except for: a block of text with all the article titles, 10 = signs to separate the titles and the articles, each article has a title that is always in all caps, and is then followed by a header of sorts with source information, etc., and then the actual article, followed by 2 \n. (an example is below).

I need to know how to approach this and perhaps some methods to figure this problem out. I'm currently thinking that this can only be solved by implementing a state machine, but I'm not sure.

Thanks, LoneRanger
FSNET NOVEMBER 29, 1999

Cyclosporiasis: Ontario
Cyclosporiasis: Guatemala

==========

CYCLOSPORIASIS: ONTARIO
November 26, 1999
Infectious Disease News Brief
Health Canada
An outbreak of enteric infection due to Cyclospora cayetanensis diarrhea
occurred in Ontario in the spring of 1999, the fourth consecutive year of
spring-time outbreaks of this parasitic infection in this province. The

CYCLOSPORIASIS: GUATEMALA
November 26, 1999
Infectious Disease News Brief
Health Canada
CDC conducted a study in health-care facilities and among raspberry farm

In reply to Random email parsing by LoneRanger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.