Grabbing data from webpages

damian1301 has asked for the wisdom of the Perl Monks concerning the following question:

I have a short career in Perl, compared to everyone else around here. In my time I have noticed that one of the most useful and convenient things is to grab stuff off webpages and have a Perl script print only the information you want neatly in an understandable form: Weather, stocks, lottery, news, etc.

I know that the easiest way to get the whole page is through the LWP::Simple module with the get method.

I have a problem though. Whenever I try to make a script that fetches something off the web it either returns undefined or it does nothing like I want it to. So now I ask you:
What suggestions do you have for me to:

Avoid regular expressions as much as possible
Use only the data you want and trash the other stuff
Make it flexable so that it will still get the data you want even though the data changes.

I think that this information could really widen the usage of Perl because I haven't seen much writing on it and its so useful and varied.

Oh yes, please include little snippets to help me understand a little, I'm a little bullheaded at times. :-) Thank you all!

Wanna be perl hacker.
Dave AKA damian
I encourage you to email me

Comment on Grabbing data from webpages Download Code

Replies are listed 'Best First'.
Re: Grabbing data from webpages by Anonymous Monk on Jan 29, 2001 at 06:47 UTC
When I wrote a script to grab data on books off Amazon.com, I found that using modules like HTML::Parser were not really much use at all. The pages that Amazon.com generates for each book contains very complicated HTML that uses tables in a very advanced way. I found that HTML::Parser and plain old regexps were much too generalised for extracting any useful data from the monstrous HTML code. I eventually came up with the idea of using the command line web browser Lynx to parse the HTML from Amazon.com for me. If you call Lynx with the '-dump' option, Lynx will parse the HTML and provide you with a nicely formatted stream devoid of HTML tags. If you pipe this stream into your Perl script, regexps and conditional statements are all that's needed for you to extract the data you require. Although my Amazon.com script was quite long and contained a lot of if...elsif statements, I feel that it was a hell of a lot better than struggling with pure regexps and/or HTML::Parser. My code is along the lines of: `$Amazon_URL = "http://www.amazon.com/exec/obidos/asin/"; $ISBN = ""; ### Insert some code to get the ISBN $Amzon_URL .= $ISBN; open(FILEHANDLE, "lynx -dump $Amazon_URL\|") or die ("Can't get book data!"); @book_data = <FILEHANDLE>; close(FILEHANDLE);` [download] Then, you can just feed the @book_data array into your own parser procedure. For the parser procedure, I looped through the @book_data array look for what I called "markers" in the parsed HTML. These are just bits of text that occur near the data you are looking for. Using a marker, I extracted the data fields I wanted by using offsets from the line on which the marker occured. In my opinion, for complicated HTML web pages, this method is a lot easier than using plain regexps or HTML::Parser. However, for simpler pages, HTML::Parser and/or regexps are all that's really needed. I hope this helps.	[reply] [d/l]
Re: Re: Grabbing data from webpages by andye (Curate) on Jan 29, 2001 at 15:45 UTC
I agree, `Lynx -dump` makes life a lot easier (lazyness being my personal favourite virtue!). On the other hand, I'm not doing anything complex with the page... anyway, here's a simple example... #!/usr/bin/perl -w use strict; use Mail::Mailer; my $recipient ="my email address"; my $sender = "trains\@myhost.co.uk"; my $subject = "trains"; my $mailer = Mail::Mailer->new("sendmail"); $mailer->open({From => $sender, To => $recipient, 'Content-Type' => "text/plain", Subject => $subject }) or die "can't open sendmail"; open (LYNX,'lynx -dump http://www.londontransport.co.uk/rt_home.shtml +\|'); while (<LYNX>) { s/(\[.*\])//g ; print $mailer $_ if (/Real time news/ .. /References/); } close(LYNX); $mailer->close(); [download] (the regexp with the square brackets just gets rid of the image text, otherwise you get a certain amount of '`[spacer.gif]`') andy.	[reply] [d/l] [select]
Re: Grabbing data from webpages by Beatnik (Parson) on Jan 28, 2001 at 23:17 UTC
WebFetch is one of those modules listed on CPAN that has an extensive module set for grabbing some well known sites. Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: Re: Grabbing data from webpages by damian1301 (Curate) on Jan 28, 2001 at 23:53 UTC
Well modules are a great thing in many cases but I don't think a module knows what I want to grab off weatherchannel.com. I would rather some more hardcoding than module use because I don't want to save anything to my system as it says in that documentation. Instead I would rather have it like statswhore.pl. It grabs the information and tells a couple stats in a nicely printed fashion. Thanx for your help but it was a little off what I was looking for :-) Wanna be perl hacker. Dave AKA damian I encourage you to email me	[reply]
Re: Re: Re: Grabbing data from webpages by eg (Friar) on Jan 29, 2001 at 01:01 UTC
I can't speak for Beatnick, but my impression is that the suggestion is to use WebFetch as a basis to fill your own needs. Download it, play with it, see what comes close to doing what you want, copy it and modify. It's a great way to learn and it gives you the opportunity to give back to the project (chances are if you find a certain application useful (e.g. Weather Channel, TV listings, whatever), someone will too.) To answer your original question, some combination of LWP and HTML::Parser (or one of its subclasses) will do what you want.	[reply]