sandal has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I would like try the simple perl-based text processor(url extractor) which will extract from web page some data and show it on separate page.
I mean user input as follows:
I copy and paste page url into special field into cgi page and click "Process" button. Script connects to target page and extract the required data. I want to get the list with url and names against each link:

rtsp://294.173.72.132/storage/01BE012430_1000.wmv JanetJackson_Again
The variable in this link is only 01BE012430, the rest is constant.
The variable data is inside html code, it contains within the followind string:
ID(10 numbers):
1) <input type="radio" id="rad01BE012430" name="rad01BE012430" checked>Full need be taken from name="rad01BE012430", rad prefix need be cutted.

or from

2) ..checked VALUE="01BE012430">

Artist and sond titel locate before ID, inside this code:
1. <font color="#333333">Melanie G</font></strong><font color="#333333 +"><br>Word Up</a></font></span> <BR> <td width="20%"><b>Artist</b></td> <td width="80%">Sandra</td> <br> <td width="20%"><b>Title</b></td> <td width="80%">Secret Land</td>,
or
<font color="#333333">Melanie G</font></strong><font color="#333333"> +<br>Word Up</a></font></span><br>
Need help to make this script. Dont know, possible Javascript may better handle this, but it will require copy and paste all html code in javascript page, locally. Not so bad.

Replies are listed 'Best First'.
Re: Links and data extractor script
by Popcorn Dave (Abbot) on Oct 25, 2005 at 19:39 UTC
    The first thing you want to use is LWP::Simple. If you're just trying to download a page and extract text then something like this should get you started:

    use strict; use LWP::Simple; use HTML::TokeParser; my ($filename, $results); $results = get('http://[whatever website you want to grab]'); open FH, ">$filename"; # or whatever you want to call your file print FH, $results; close FH; my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { # check for your information here # This is from a token parser I wrote to parse web pages. # Hopefully it will get you going # # The S token is a start tag, and the if function returns # the data associated with the tag. # # if ($token->[0] eq "S"){ # print "Token:S 1:$token->[1]\n"; # foreach my $key(keys %{$token->[2]}){ # print "Key: $key Value: #${$token->[2]}{$key}\n"; # } # print "3: @{$token->[3]}\n4: $token->[4]\n\n"; } }

    That should at least get you on the road to what you're after.

    Good luck!

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
      Some amplification:
      the required url
      rtsp://294.173.72.132/storage/01BE012430_1000.wmv
      does not contains in html cource, it's a predefined url, which I know, there are one variable need be simply replaced which new one which will be found and taken from a html source.
      thanks a lot.
Re: Links and data extractor script
by planetscape (Chancellor) on Oct 25, 2005 at 19:33 UTC

    You might also wish to take a look at WWW::Mechanize. Some tools that can assist you in creating scripts for WWW::Mechanize (which has its own cookbook) include HTTP::Recorder and Ethereal (the latter being a Network Protocol Analyzer which lets you see exactly what your browser is communicating to a site and vice-versa.)

    An article that will help you get started with HTTP::Recorder may be found here.

    HTH,

    planetscape
Re: Links and data extractor script
by Tanktalus (Canon) on Oct 25, 2005 at 19:30 UTC

    I recommend looking at HTML::LinkExtractor and using Super Search on that module to see if that does what you want it to. When you get the URL, you can use LWP::Simple to fetch it, then feed that into HTML::LinkExtractor and you should be able to get all the attributes and text that you need.

    If HTML::LinkExtractor doesn't quite do what you need, I'd move on to HTML::Parser if the HTML is not well-formed XML, or XML::Twig if it is. These should have the flexibility to do what you want, at the expense of a bit more code to write.