Re: Find and replace

I don't think you'll be needing HTML::Parser, your html-code isn't really clean enough for that, and you don't want to deal with all the nasty font-tags.

A few patterns should get the job done much faster.

I only tested with two reviews, here's what I got so far:

        # find the place where the features start:

        $_ = <IN> until m/face=arial color=white/;

        # now read the dvd features

        FEATURE:
        while(<IN>) {
                last FEATURE if m/blcorn.jpg/ or m/td/;

                m/src="(.).jpg">(&nbsp;)*([^<]+)/ 
                     and print "$3: $1\n";
        }

        while($_ = <IN>)  {
                last if /<p>/;
        }
        $_ =~ s/.*<b>//;
        $_ =~ s/<.*//;
        print "Title: $_";

        $_ = <IN>;
        s/^\s*//;
        s/<.*//;
        print $_;  # that's the studio

        $_ = <IN>;
        s/^\s*//;
        $_ =~ s/Reviewed by: //;
        $_ =~ s|</font>||;
        $_ =~ s|</p>||;
        print $_;  # that's the reviewer

        # now for the text of the review:
        REVIEW:
        while ( $_ = <IN> ) {
                last REVIEW if m/<table/ or m/center/;
                $r .= $_;
        }

        # now some magic to clean up the font-soup:

        $r =~ s|<font[^>]*>||g;
        $r =~ s|</font>||g;

        # now some magic to turn <br> <br> into <p>,

        $r =~ s|<b>\s*<br>|<br><b>|gs;   # extra magic
        $r =~ s|<br>(\s*<br>)+|<p>|gs;

        # now we can recognize the headlines, and turn them into <h2>

        $r =~ s|<p>\s*<b>\s*([^<]+)\s*</b>\s*<br>|<h2>$1</h2>|gs;

        print $r;
        close IN;
[download]

aargh, data munging at it's dirtiest. You should probably read davorgs book on Data Munging with Perl, and learn how to do this in a more organized, controlable way.

P.S. if you were serious about giving away dvds:

Brigitte Jellinek
Horus IT GmbH
Jakob Haringer Str. 8
5020 Salzburg
Austria
EUROPE

Comment on Re: Find and replace Download Code

Replies are listed 'Best First'.
Re: Re: Find and replace by davorg (Chancellor) on Oct 19, 2001 at 12:56 UTC
There's a section in my book (section 8.2) that explains in detail exactly why you shouldn't parse HTML with regular expressions. HTML::TreeBuilder (which, I think, would be the most useful module for this task) goes to great lengths to try to make sense of invalid HTML. It'll handle just about anything that you throw at it. -- <http://www.dave.org.uk> "The first rule of Perl club is you don't talk about Perl club."	[reply]
Re: Re: Find and replace by dvdauthority (Initiate) on Oct 18, 2001 at 22:10 UTC
First off, thanks for all the help/input guys...guess my .asp comment wasn't that bad after all! ha. I do have most of my stuff stored in a SQL database, so I have that base covered. It has links,author,studio and so on. I'll start off by saying that I have no clue how Perl works and I have no experience in it. I am serious about the DVD thing. If this will work, it will save me three or four weeks of hard programming. Drop me a line and hopefully we can see if we can go somewhere from there. Thanks again... Matt matt@dvdauthority.com	[reply]
Re: Re: Re: Find and replace by alien_life_form (Pilgrim) on Oct 19, 2001 at 13:19 UTC
Greetings. If you really have no perl clue/experience... you're in four three or four weeks of hard perl learning, followed by three or four hours of not extremely hard programming. Of course, at the end of the day you'll end up as somebody who knows perl, as opposed to somebody who knows how to cut and paste.... If you don't have that kind of time, a hired perl gun may be your next best choice. As for the issue at hand, I still stand by HTML::Treebuilder - and not only for aesthetical reasons. But regardless, you should know that, when your review cleaning code is ready, you could top everything off by using DBI to stuff them in your SQL database - directly from perl. Cheers, alf	[reply]