Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I have to parse through an email that is sent out daily. The people who send it out, just copy and paste from websites around the internet so the formatting is terrible. It contains various articles about different health related topics.

The emails are always different except for: a block of text with all the article titles, 10 = signs to separate the titles and the articles, each article has a title that is always in all caps, and is then followed by a header of sorts with source information, etc., and then the actual article, followed by 2 \n. (an example is below).

I need to know how to approach this and perhaps some methods to figure this problem out. I'm currently thinking that this can only be solved by implementing a state machine, but I'm not sure.

Thanks, LoneRanger
FSNET NOVEMBER 29, 1999

Cyclosporiasis: Ontario
Cyclosporiasis: Guatemala

==========

CYCLOSPORIASIS: ONTARIO
November 26, 1999
Infectious Disease News Brief
Health Canada
An outbreak of enteric infection due to Cyclospora cayetanensis diarrhea
occurred in Ontario in the spring of 1999, the fourth consecutive year of
spring-time outbreaks of this parasitic infection in this province. The

CYCLOSPORIASIS: GUATEMALA
November 26, 1999
Infectious Disease News Brief
Health Canada
CDC conducted a study in health-care facilities and among raspberry farm

In reply to Random email parsing by LoneRanger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-25 19:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found