Parsing News from a Site Backend

Segfault has asked for the wisdom of the Perl Monks concerning the following question:

I've decided to write a script to parse the news headlines from BeOSCentral.com, using the back-end that provides the info in one big chunk of plain text. It's a long, single line of text, with multiple parts each in this format:

%% Headline Text <date> <time> <article URL> <brief description>

My question is, how can I separate each of these blocks of data from this big block that I get from the back-end, and then work with the headline, date, etc. as I wish? What would be the best approach to splitting this up?

Thanks for any tips you can provide. :-)

Comment on Parsing News from a Site Backend Download Code

Replies are listed 'Best First'.
Re: Parsing News from a Site Backend by chromatic (Archbishop) on Feb 28, 2001 at 10:55 UTC
I'm surprised they don't have an RSS feed. Anyway, it looks like you can set $/ to "%%\n", read in a chunk at a time, then split on \n to get the individual lines. To separate the date and time, split on a space. If you fetch the data with LWP::Simple, then don't set $/, just split on "%%\n". :)	[reply]
(dkubb) Re: (2) Parsing News from a Site Backend by dkubb (Deacon) on Feb 28, 2001 at 12:53 UTC
Here's some example code that parses the BeOS Central headlines page, and returns an array of hash refs: #!/usr/bin/perl -w use strict; use LWP::Simple qw(get); use Data::Dumper qw(Dumper); #Define the columns to parse out my @columns = qw( headline year month day hour minute second url description ); #Generate a regex to fetch the column data my $regex = join '\n', ( '([^\n]+)', '(\d{4})-(\d{2})-(\d{2})\s(\d{2}):(\d{2}):(\d{2})', '([^\n]+)', '(.?)\s+', #match everything, except the last bit of whitespace ); #get the web page my $text = get('http://www.beoscentral.com/headlines.php'); my @rows; foreach my $record (split "%%\n", $text) { my %row; @row{@columns} = ($record =~ /^$regex$/so) or next; push @rows, \%row; } print Dumper(\@rows); __END__ [download] I am sure the regex could be done in a faster/better/elegant way, but the answer eludes me at this time. Update:* Removed the /g modifier from the regex. It was a useless addition to the regex in this case.	[reply] [d/l]
Re: Parsing News from a Site Backend by Yohimbe (Pilgrim) on Feb 28, 2001 at 10:57 UTC
Something like this works fine. Its tested. `#!/usr/bin/perl undef $/; my $text=<STDIN>; my @data=split(/%%\n/,$text); my ($headline,$datetime,$url,$desc); foreach (@data) { ($headline,$datetime,$url,$desc)=split(/\n/,$_); print <<EOF; Headline: $headline <BR> Date and Time: $datetime<BR> URL: $url<BR> Desc: $desc<BR> EOF next; }` [download] -- Jay "Yohimbe" Thorne, alpha geek for UserFriendly	[reply] [d/l]
Re: Parsing News from a Site Backend by archon (Monk) on Feb 28, 2001 at 11:00 UTC
In order to provide a regular expression, we would need to know more about the data line. Are there characters (e.g. tabs) that separate each field? Are the fields fixed width? Are those <> characters actually part of the line? You can't just split up a line of arbitrary without something to go on. In regards to regular expressions, you might find it helpful to write down exactly how it needs to be split up. You can then turn your individual sentences into fragments of your regular expression. For more information, check the perlre manpage and the perlfunc entry for `unpack`. You might also want to look into getting Mastering Regular Expressions.	[reply] [d/l]
Re: Parsing News from a Site Backend by princepawn (Parson) on Feb 28, 2001 at 22:21 UTC
Also see NewsClipper	[reply]