Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Parsing "free form" documents is always tricky, and you may have to end up encoding some special cases. But the trick is to find regularities. As a first step, here's my take:

The news items themselves include the titles, so I would say you can just skip everything up to the line of equal signs. Then, you can read each news item as a paragraph, and consider the first line to be the title.

The following snippet of code stores the news items in %db, using the title as the key, containing the "body" (they could as well be stored in an array, if you want to preserve the order).

use strict; my $f=0; my %db; $/=""; while (<>) { $f=1,next if /^==========/; next unless $f; my @item=split /\n/, $_, 2; $db{$item[0]}=$item[1]; } foreach (keys %db) { print "Title: $_\nBody: $db{$_}"; }
A further step would be to parse the body. There again, the trick is to find any regularities. In the example data you gave, there are 3 lines of "headers" followed by the text. If this is always the case, something like this could do the trick:
@body=split /\n/, $body, 4;
And you would end up with the three headers in @body[0,1,2] and the text in $body. If the "three header lines" rule does not apply, you could use some other heuristic. For example, are header lines always less than 40 characters in length? Then you could use something like this: (untested):
my @lines=split /\n/, $body; my @hdr; my $l; while ($l=shift(@lines)) { last if length($l)>40 push @hdr, $l; } $body=join("\n", $l, @lines);
Which would leave all the initial shorter-than-40 character lines in @hdr, and the rest re-joined with newlines in $body.

--ZZamboni


In reply to Re: Random email parsing by ZZamboni
in thread Random email parsing by LoneRanger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2022-05-18 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (68 votes). Check out past polls.

    Notices?