When I wrote a script to grab data on books off Amazon.com, I found that using modules like HTML::Parser were not really much use at all. The pages that Amazon.com generates for each book contains very complicated HTML that uses tables in a very advanced way. I found that HTML::Parser and plain old regexps were much too generalised for extracting any useful data from the monstrous HTML code.

I eventually came up with the idea of using the command line web browser Lynx to parse the HTML from Amazon.com for me. If you call Lynx with the '-dump' option, Lynx will parse the HTML and provide you with a nicely formatted stream devoid of HTML tags. If you pipe this stream into your Perl script, regexps and conditional statements are all that's needed for you to extract the data you require.

Although my Amazon.com script was quite long and contained a lot of if...elsif statements, I feel that it was a hell of a lot better than struggling with pure regexps and/or HTML::Parser.

My code is along the lines of:

$Amazon_URL = "http://www.amazon.com/exec/obidos/asin/"; $ISBN = ""; ### Insert some code to get the ISBN $Amzon_URL .= $ISBN; open(FILEHANDLE, "lynx -dump $Amazon_URL|") or die ("Can't get book data!"); @book_data = <FILEHANDLE>; close(FILEHANDLE);

Then, you can just feed the @book_data array into your own parser procedure.

For the parser procedure, I looped through the @book_data array look for what I called "markers" in the parsed HTML. These are just bits of text that occur near the data you are looking for. Using a marker, I extracted the data fields I wanted by using offsets from the line on which the marker occured.

In my opinion, for complicated HTML web pages, this method is a lot easier than using plain regexps or HTML::Parser. However, for simpler pages, HTML::Parser and/or regexps are all that's really needed.

I hope this helps.


In reply to Re: Grabbing data from webpages by Anonymous Monk
in thread Grabbing data from webpages by damian1301

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.