Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Screen scraping

by stumped (Novice)
on Dec 28, 2015 at 13:49 UTC ( [id://1151262]=perlquestion: print w/replies, xml ) Need Help??

stumped has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to scrape some info from a web site. I can navigate to the page using mechanize, but I am struggling to getting the text from the page. It is contained in div tags eg:

 <div id="foo">bar: 19.00</div>

What I want to do is extract the text "bar: 19.00". I am looking for is a way to point to the div id 'foo' and then extract the contents to the next div close tag.

I would be grateful for any pointers.

Thanks

Replies are listed 'Best First'.
Re: Screen scraping
by Corion (Patriarch) on Dec 28, 2015 at 14:48 UTC
Re: Screen scraping
by ambrus (Abbot) on Dec 28, 2015 at 17:03 UTC

    See Re: How to extract text present in 3 lines within the HTML tags as an example for how to use the XML::Twig module to parse HTML input. This isn't the only perl module you could use, but let's stick to it for now. That answer also shows how to extract text inside a HTML element once you've found the element.

    Now you just need one more piece to solve your problem. You have to find the right element, which isn't just any div element, but the div element with the particular id attribute. For that, look at the documentation of the XML::Twig module for a method of XML::Twig that returns an element of a particular id. If you can't find it, look at the hint under the fold.

Re: Screen scraping (web scraping!)
by Discipulus (Canon) on Dec 28, 2015 at 19:54 UTC
    for the matter i've bookmarked the thread The State of Web spidering in Perl

    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Screen scraping
by Anonymous Monk on Dec 28, 2015 at 15:40 UTC

    You can try with a simple RegExp like

    my ($foo) = $html =~ /id="foo">([^<]+)</gm; print "$foo\n";
      And then you have something a little bit more complicated than the simple example and your solution breaks.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1151262]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-28 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found