in reply to Grabbing data from webpages

When I wrote a script to grab data on books off Amazon.com, I found that using modules like HTML::Parser were not really much use at all. The pages that Amazon.com generates for each book contains very complicated HTML that uses tables in a very advanced way. I found that HTML::Parser and plain old regexps were much too generalised for extracting any useful data from the monstrous HTML code.

I eventually came up with the idea of using the command line web browser Lynx to parse the HTML from Amazon.com for me. If you call Lynx with the '-dump' option, Lynx will parse the HTML and provide you with a nicely formatted stream devoid of HTML tags. If you pipe this stream into your Perl script, regexps and conditional statements are all that's needed for you to extract the data you require.

Although my Amazon.com script was quite long and contained a lot of if...elsif statements, I feel that it was a hell of a lot better than struggling with pure regexps and/or HTML::Parser.

My code is along the lines of:

$Amazon_URL = "http://www.amazon.com/exec/obidos/asin/"; $ISBN = ""; ### Insert some code to get the ISBN $Amzon_URL .= $ISBN; open(FILEHANDLE, "lynx -dump $Amazon_URL|") or die ("Can't get book data!"); @book_data = <FILEHANDLE>; close(FILEHANDLE);

Then, you can just feed the @book_data array into your own parser procedure.

For the parser procedure, I looped through the @book_data array look for what I called "markers" in the parsed HTML. These are just bits of text that occur near the data you are looking for. Using a marker, I extracted the data fields I wanted by using offsets from the line on which the marker occured.

In my opinion, for complicated HTML web pages, this method is a lot easier than using plain regexps or HTML::Parser. However, for simpler pages, HTML::Parser and/or regexps are all that's really needed.

I hope this helps.

Replies are listed 'Best First'.
Re: Re: Grabbing data from webpages
by andye (Curate) on Jan 29, 2001 at 15:45 UTC
    I agree, Lynx -dump makes life a lot easier (lazyness being my personal favourite virtue!). On the other hand, I'm not doing anything complex with the page... anyway, here's a simple example...

    #!/usr/bin/perl -w use strict; use Mail::Mailer; my $recipient ="my email address"; my $sender = "trains\@myhost.co.uk"; my $subject = "trains"; my $mailer = Mail::Mailer->new("sendmail"); $mailer->open({From => $sender, To => $recipient, 'Content-Type' => "text/plain", Subject => $subject }) or die "can't open sendmail"; open (LYNX,'lynx -dump http://www.londontransport.co.uk/rt_home.shtml +|'); while (<LYNX>) { s/(\[.*\])//g ; print $mailer $_ if (/Real time news/ .. /References/); } close(LYNX); $mailer->close();
    (the regexp with the square brackets just gets rid of the image text, otherwise you get a certain amount of '[spacer.gif]')

    andy.