in reply to Grabbing data from webpages
I eventually came up with the idea of using the command line web browser Lynx to parse the HTML from Amazon.com for me. If you call Lynx with the '-dump' option, Lynx will parse the HTML and provide you with a nicely formatted stream devoid of HTML tags. If you pipe this stream into your Perl script, regexps and conditional statements are all that's needed for you to extract the data you require.
Although my Amazon.com script was quite long and contained a lot of if...elsif statements, I feel that it was a hell of a lot better than struggling with pure regexps and/or HTML::Parser.
My code is along the lines of:
$Amazon_URL = "http://www.amazon.com/exec/obidos/asin/"; $ISBN = ""; ### Insert some code to get the ISBN $Amzon_URL .= $ISBN; open(FILEHANDLE, "lynx -dump $Amazon_URL|") or die ("Can't get book data!"); @book_data = <FILEHANDLE>; close(FILEHANDLE);
Then, you can just feed the @book_data array into your own parser procedure.
For the parser procedure, I looped through the @book_data array look for what I called "markers" in the parsed HTML. These are just bits of text that occur near the data you are looking for. Using a marker, I extracted the data fields I wanted by using offsets from the line on which the marker occured.
In my opinion, for complicated HTML web pages, this method is a lot easier than using plain regexps or HTML::Parser. However, for simpler pages, HTML::Parser and/or regexps are all that's really needed.
I hope this helps.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Grabbing data from webpages
by andye (Curate) on Jan 29, 2001 at 15:45 UTC |