heezy has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

I want to pass a treestructure of webpages (all with a common-ish formatting) and add key pieces of data within the pages to a database.

I am really overwhelmed by the number of HTML modules available and wondered if anyone has any comments on what to avoid, best practices, major pitfals, time serving perl-ish plans etc...

I want to glean three pieces of info from the webpage

It appears that all the pages following the same kind of formatting..

<TABLE WIDTH="100%" BORDER="0" CELLSPACING="1" CELLPADDING="3"> <!-- CATEGORY --> <TR><TD CLASS="dkblue" COLSPAN="3"><A NAME="Sun Ultra 60"></A><BIG> <B>Sun Ultra 60 Documentation</B></BIG></TD></TR> <TR VALIGN="TOP" CLASS="white"> <TD>804-5884-10</TD> <TD WIDTH="90%"><B>Sun Ultra 60 Hardware AnswerBook Installation</ +B></TD> <TD><A HREF="/products-n-solutions/hardware/docs/pdf/804-5884-10.p +df" TARGET="results">pdf</A> (42KB)</TD></TR> <TR VALIGN="TOP" CLASS="lttan"> <TD>804-5886-10</TD> <TD><B>Installing the Sun Ultra 60 ShowMe How Multimedia Documenta +tion</B></TD> <TD><A HREF="/products-n-solutions/hardware/docs/pdf/804-5886-10.p +df" TARGET="results">pdf</A> (62KB)</TD></TR> <TR VALIGN="TOP" CLASS="white"> <TD>805-1709-12</TD> <TD><B>Sun Ultra 60 Service Manual</B></TD> <TD><A HREF="/products-n-solutions/hardware/docs/pdf/805-1709-12.p +df" TARGET="results">pdf</A> (6.5MB)</TD></TR> <TR VALIGN="TOP" CLASS="lttan"> <TD>805-1762-11</TD> <TD><B>Sun Ultra 60 Reference Manual</B></TD> <TD><A HREF="/products-n-solutions/hardware/docs/pdf/805-1762-11.p +df" TARGET="results">pdf</A> (344KB)</TD></TR> </TABLE>

...but obviously there is loads of other formatting on the page to be getting in my way.

Ideas on any really useful modules?

Any suggestions or tips would be helpful

Thanks monks,

m

Replies are listed 'Best First'.
Re: Hints & Tips on passing HTML?
by Ryszard (Priest) on Feb 28, 2003 at 08:00 UTC
Re: Hints & Tips on passing HTML?
by grantm (Parson) on Feb 28, 2003 at 08:38 UTC

    Here's a suggestion from left field: use XML::LibXML to read the HTML files (libxml supports reading HTML which is not well-formed XML) and use XPath expressions to locate the data items you're after.

      to add a pointer, this (using XML::LibXML) is exactly what IlyaM did very recently here.

      -- Hofmator

Re: (nrd) Hints & Tips on passing HTML?
by newrisedesigns (Curate) on Feb 28, 2003 at 17:47 UTC

    If you are extracting information from pages with a similar format: HTML::TokeParser. Fetching the pages from a different server? Use LWP.

    TokeParser easily strips out the text above:

    #just to give you an idea about extracting text #not complete or tested while(my $token = $stream->get_token()){ if(($token->[0] eq 'S') && ($token->[1] eq 'td')){ my @tokens; my (@headers, @links); push(@tokens, $stream->get_token()) x 2; if(($tokens[0][0] eq 'S') && ($tokens[0][1] eq 'b')){ push(@headers, $tokens[1][1]); } if(($tokens[0][0] eq 'S') && ($tokens[0][1] eq 'a')){ push(@links, $tokens[0][2]{'href'}); } $stream->unget_token(@tokens); } }

    When redisplaying that information, push it into a nice template using HTML::Template. Make it dynamic using CGI and the CGI module.

    John J Reiser
    newrisedesigns.com

Re: Hints & Tips on passing HTML?
by Fletch (Bishop) on Feb 28, 2003 at 17:30 UTC

    Get Perl and LWP (ISBN ISBN 0596001789). Covers HTML scraping in detail. Well worth the $25-ish price.

Re: Hints & Tips on passing HTML?
by revdiablo (Prior) on Feb 28, 2003 at 23:30 UTC

    If you're having problems getting to the page in question, you might also want to check out WWW::Mechanize instead of plain old LWP. WWW::Mechanize makes it easy to navigate through relatively complex web pages with forms and other stuff that's a bit of a pain to do by hand.

Re: Hints & Tips on passing HTML?
by heezy (Monk) on Feb 28, 2003 at 23:54 UTC

    Thanks to everyone that has replied there are some really good things to pursure here!