http://qs1969.pair.com?node_id=349974

Agyeya has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I am new to perl. I want to write a perl program which can scrape information from a webpage,(say a table showing stock or share prices), and add it into proper tables(using MySql). This I want to do, every two minutes. Also if there is any change in the tables, I want to record the changes in another table. I am using Red Hat 9.2 as my operating System. Thanx in advance

Replies are listed 'Best First'.
Re: scraping from HTTP page to MySql table
by b10m (Vicar) on May 03, 2004 at 10:43 UTC

    Hello and welcome to Perlmonks,

    This can certainly be done, but unfortunately, Perlmonks is not a place where people will build scripts for you. We can help you out with problems, but you'll need to actually code most of it yourself.

    To start, you probably want to learn more about Perl and this is a good place. Besides that, books are always good to have around and many monks would advice Learning Perl and Programming Perl (both O'Reilly books).

    After that, you might want to look at modules these modules:

    --
    b10m

    All code is usually tested, but rarely trusted.
      The page i want to scrape information from, comes up in a javascript popup window. Now how do i link to this window?
Re: scraping from HTTP page to MySql table
by matija (Priest) on May 03, 2004 at 10:41 UTC
    Learn about CPAN.

    To fetch the web page, you could use LWP::Simple or LWP::UserAgent. To parse the page and extract the data, you might be able to use HTML::TableExtract or HTML::Parser.

    Once you have the data you need, you can save them to a mysql database using either Class::DBI (if you are Object Oriented) or DBD::mysql - if you like to live closer to the bare metal. (both use DBI). You have enough material now, I think. Start writing the script, and if you have problems, ask well thought out questions, and we'll help you solve them.

Re: scraping from HTTP page to MySql table
by z3d (Scribe) on May 03, 2004 at 12:55 UTC
    Like the posts before me, I won't offer code, only recommendations and insight. I would start by warning you - unless you run the website you are scraping, or have an existing relationship with the owners, you may want to think twice about a direct scraping every two minutes. Not everyone appreciates having their website hit repeatedly and consistantly to scrape data.

    In addition to the modules already mentioned, I'd also recommend reading through past articles. I know that both perl.com and TPJ have run articles about exactly this, perl.com in the last few months (so it might still be found on their front page, not sure).



    "I have never written bad code. There are merely unanticipated features."
Re: scraping from HTTP page to MySql table
by Ryszard (Priest) on May 03, 2004 at 14:13 UTC
    You know, its more complicated, (and More powerful IMO) but i've just learned HTML::TokeParser, which is my now preferred HTML parser.

    If you want some example code, check out jeffa's excellent IMDB::Movie.

Re: scraping from HTTP page to MySql table
by chanio (Priest) on May 03, 2004 at 18:21 UTC
    In order to know when to re-check the site for changes, you'ld rather ask its webmaster the hours when she renews the site. You could even suggest her to publish the changes at a newsfeed site (Sourceforge has it) like syndic8 *.

    Then to get the notice of those news(changes) you should extract an XML file called RSS or RDF that specifies what articles have changed, or simple that you should re-check the site.

    There are also PM to extract the RSS info from those files and even download them at a specified frequency:

    see RSS at CPAN**.

    (*) http://www.syndic8.com/

    (**)http://search.cpan.org/search?mode=dist&query=RSS

    {\('v')/}
    _`(___)' __________________________
      Hi

      the site that i wish to be monitoring is a dynamic site. It may have details that are subject to random change. E.g consider the seat status in a train or bus. or even consider the appointment list of a doctor. Now on the site the list will be in the form of an excel table. Having fields, Patient ID, Appointment type, Appointment date, appointment time.

      Now suppose that a patient wants an appointment. so instead of putting him at the end of the queue, we can check the appointment list for any random cancellations, at put the patient in that slot.(this is just an example, as obviously the next patient in the queue shoukd be advanced). But considering how people have divded their own time in slots. the free time of the patient shuld match that of the vacancy in the appointment list.