I have to parse through an email that is sent out daily. The people who send it out, just copy and paste from websites around the internet so the formatting is terrible. It contains various articles about different health related topics.

The emails are always different except for: a block of text with all the article titles, 10 = signs to separate the titles and the articles, each article has a title that is always in all caps, and is then followed by a header of sorts with source information, etc., and then the actual article, followed by 2 \n. (an example is below).

I need to know how to approach this and perhaps some methods to figure this problem out. I'm currently thinking that this can only be solved by implementing a state machine, but I'm not sure.

Cyclosporiasis: Ontario
Cyclosporiasis: Guatemala


November 26, 1999
Infectious Disease News Brief
Health Canada
An outbreak of enteric infection due to Cyclospora cayetanensis diarrhea
occurred in Ontario in the spring of 1999, the fourth consecutive year of
spring-time outbreaks of this parasitic infection in this province. The

November 26, 1999
Infectious Disease News Brief
Health Canada
CDC conducted a study in health-care facilities and among raspberry farm