dvdauthority has asked for the wisdom of the Perl Monks concerning the following question:

Hello all...I have got a BIG problem. I have a website that has around 2200 .htm pages that I'm converting to .asp (sorry, that may be a bad word here). But I need the mighty power of Perl to parse out everything from my .htm files so that I can have my reviews (it's a DVD review site) separate from my code. Is there any way to do this because I reeeaaaaaalllyyy don't want to have to go through 2200 pages of HTML code and manually parse it out. I'm more than willing to compensate whoever can help me with plenty of DVD's!!! Thanks in advance and email me at matt@dvdauthority.com if you have any questions!

Replies are listed 'Best First'.
(jeffa) Re: Find and replace
by jeffa (Bishop) on Oct 18, 2001 at 19:27 UTC
    Yes, Perl is an excellent choice for this problem. You will need to look into one of the many HTML parsers available on CPAN, such as HTML::Parser or my personal fav, HTML::TokeParser.

    Once you find the right 'keys' to extract the target info, i assume you will want to store them somewhere, such as a database. You will need to look into the DBI modules on CPAN.

    If you are planning on using Perl to _convert_ each entire page into .asp - well, that's going to be a bit tougher. You are going to need some man hours any way you slice it. If the pages have a lot of commonality then the task will be easier, but you still need to plan this one out. My experience in the past with porting scripts and such is usually to issue a ton of clever Perl one-liner substitutions only to find that i still have to make some changes by hand.

    Good luck, and keep it legal ;)

    jeffa

Re: Find and replace
by alien_life_form (Pilgrim) on Oct 18, 2001 at 19:25 UTC
    Greetings.

    You are probably looking for either HTML::Parser or its brethren HTML::Treebuilder.

    Cheers
    alf

Re: Find and replace
by Rich36 (Chaplain) on Oct 18, 2001 at 20:02 UTC
    What kind of solution are you looking at for interfacing with your data? Like alf and jeffa suggested, HTML::Parse or HTML::TokeParser would probably be the way to go to actually get the data.

    If you're using some kind of database solution, consider outputting the parsed data into a file as columns (space/tab/character delimited fields) so that you can easily import the data into a table format.

    Rich36


Re: Find and replace
by Anonymous Monk on Oct 18, 2001 at 20:31 UTC

    I don't think you'll be needing HTML::Parser, your html-code isn't really clean enough for that, and you don't want to deal with all the nasty font-tags.

    A few patterns should get the job done much faster.

    I only tested with two reviews, here's what I got so far:

    # find the place where the features start: $_ = <IN> until m/face=arial color=white/; # now read the dvd features FEATURE: while(<IN>) { last FEATURE if m/blcorn.jpg/ or m/td/; m/src="(.).jpg">(&nbsp;)*([^<]+)/ and print "$3: $1\n"; } while($_ = <IN>) { last if /<p>/; } $_ =~ s/.*<b>//; $_ =~ s/<.*//; print "Title: $_"; $_ = <IN>; s/^\s*//; s/<.*//; print $_; # that's the studio $_ = <IN>; s/^\s*//; $_ =~ s/Reviewed by: //; $_ =~ s|</font>||; $_ =~ s|</p>||; print $_; # that's the reviewer # now for the text of the review: REVIEW: while ( $_ = <IN> ) { last REVIEW if m/<table/ or m/center/; $r .= $_; } # now some magic to clean up the font-soup: $r =~ s|<font[^>]*>||g; $r =~ s|</font>||g; # now some magic to turn <br> <br> into <p>, $r =~ s|<b>\s*<br>|<br><b>|gs; # extra magic $r =~ s|<br>(\s*<br>)+|<p>|gs; # now we can recognize the headlines, and turn them into <h2> $r =~ s|<p>\s*<b>\s*([^<]+)\s*</b>\s*<br>|<h2>$1</h2>|gs; print $r; close IN;

    aargh, data munging at it's dirtiest. You should probably read davorgs book on Data Munging with Perl, and learn how to do this in a more organized, controlable way.

    P.S. if you were serious about giving away dvds:

    Brigitte Jellinek
    Horus IT GmbH
    Jakob Haringer Str. 8
    5020 Salzburg
    Austria
    EUROPE
    

      There's a section in my book (section 8.2) that explains in detail exactly why you shouldn't parse HTML with regular expressions.

      HTML::TreeBuilder (which, I think, would be the most useful module for this task) goes to great lengths to try to make sense of invalid HTML. It'll handle just about anything that you throw at it.

      --
      <http://www.dave.org.uk>

      "The first rule of Perl club is you don't talk about Perl club."

      First off, thanks for all the help/input guys...guess my .asp comment wasn't that bad after all! ha. I do have most of my stuff stored in a SQL database, so I have that base covered. It has links,author,studio and so on. I'll start off by saying that I have no clue how Perl works and I have no experience in it. I am serious about the DVD thing. If this will work, it will save me three or four weeks of hard programming. Drop me a line and hopefully we can see if we can go somewhere from there. Thanks again... Matt matt@dvdauthority.com
        Greetings.

        If you really have no perl clue/experience... you're in four three or four weeks of hard perl learning, followed by three or four hours of not extremely hard programming. Of course, at the end of the day you'll end up as somebody who knows perl, as opposed to somebody who knows how to cut and paste.... If you don't have that kind of time, a hired perl gun may be your next best choice.

        As for the issue at hand, I still stand by HTML::Treebuilder - and not only for aesthetical reasons. But regardless, you should know that, when your review cleaning code is ready, you could top everything off by using DBI to stuff them in your SQL database - directly from perl.

        Cheers,
        alf