I don't think you'll be needing HTML::Parser, your
html-code isn't really clean enough for that, and you
don't want to deal with all the nasty font-tags.
A few patterns should get the job done much faster.
I only tested with two reviews, here's what I got so far:
# find the place where the features start:
$_ = <IN> until m/face=arial color=white/;
# now read the dvd features
FEATURE:
while(<IN>) {
last FEATURE if m/blcorn.jpg/ or m/td/;
m/src="(.).jpg">( )*([^<]+)/
and print "$3: $1\n";
}
while($_ = <IN>) {
last if /<p>/;
}
$_ =~ s/.*<b>//;
$_ =~ s/<.*//;
print "Title: $_";
$_ = <IN>;
s/^\s*//;
s/<.*//;
print $_; # that's the studio
$_ = <IN>;
s/^\s*//;
$_ =~ s/Reviewed by: //;
$_ =~ s|</font>||;
$_ =~ s|</p>||;
print $_; # that's the reviewer
# now for the text of the review:
REVIEW:
while ( $_ = <IN> ) {
last REVIEW if m/<table/ or m/center/;
$r .= $_;
}
# now some magic to clean up the font-soup:
$r =~ s|<font[^>]*>||g;
$r =~ s|</font>||g;
# now some magic to turn <br> <br> into <p>,
$r =~ s|<b>\s*<br>|<br><b>|gs; # extra magic
$r =~ s|<br>(\s*<br>)+|<p>|gs;
# now we can recognize the headlines, and turn them into <h2>
$r =~ s|<p>\s*<b>\s*([^<]+)\s*</b>\s*<br>|<h2>$1</h2>|gs;
print $r;
close IN;
aargh, data munging at it's dirtiest. You should probably
read davorgs book on Data Munging with Perl, and learn
how to do this in a more organized, controlable way.
P.S. if you were serious about giving away dvds:
Brigitte Jellinek
Horus IT GmbH
Jakob Haringer Str. 8
5020 Salzburg
Austria
EUROPE
|