madM has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I´m just learning how to scrap web pages using WWW::Mechanize and i have a question... for example this webpage https://openbook.etoro.com/dellos/stats/ has text in "div" tags but if i fetch the webpage and try to print all texts i only get a small sentence and anything else somebody knows how i could print all texts that are, in this case, in this web page?
use strict; use warnings; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( "https://openbook.etoro.com/dellos/stats/" ); print $mech->content(format => 'text' );

Replies are listed 'Best First'.
Re: Scraping Webpage (javascript)
by Anonymous Monk on Nov 19, 2013 at 02:52 UTC
Re: Scraping Webpage
by taint (Chaplain) on Nov 19, 2013 at 03:24 UTC
    Greetings, madM.

    I think LWP is probably more suited for the type of work, you're looking for. :)

    Best wishes.

    --Chris

    #!/usr/bin/perl -Tw
    use Perl::Always or die;
    my $perl_version = (5.12.5);
    print $perl_version;

      I think LWP is probably more suited for the type of work, you're looking for. :)

      WWW::Mechanize is built out of LWP, to save you work learning how to build WWW::Mechanize out of LWP, because 99/100 noobs who think (i need scrape this) don't know anything about HTTP, and can't wrap their minds around LWP (I need save request/response object? Whaaat?)

        Greetings.

        In my humble defense;
        I wrote an entire web page that would elicit HEAD, and every other request available in the HTTP 1.0 / 1.1 spec, including downloading the entire page. This includes sanitizing INPUT, creating the form fields, and adding graphics, and CSS. I completed the entire page in under 5 minutes, and I chose LWP, and only LWP. Why? Because inspite your assertion; WWW::Mechanize adds complexity, and overhead in this scenario. His request is a bone-headed/dead-simple request, that was exactly what LWP was made for.

        In fact, to complete OP's request, would have only required one additional Module; HTML::Restrict, and there are others. The Module I listed will STRIP the HTML tags of choice. Leaving the OP with an easily controlled/formatted document to display, at the OP's wishes.

        I hope this provides some insight for the OP.

        --Chris

        #!/usr/bin/perl -Tw
        use Perl::Always or die;
        my $perl_version = (5.12.5);
        print $perl_version;
Re: Scraping Webpage
by Discipulus (Canon) on Nov 19, 2013 at 08:21 UTC
    as taint said is a LWP task and many times an LWP::Simple one.

    Many times this tasks can be hacked in quasi-oneliner,as in:
    perl -e "use LWP::Simple; my $word = $ARGV[0];map { print qq($_ ) if / +<h2.*$word<\/h2>/../Link/ } split /\n/, get('http://dictionary.refer +ence.com/browse/'.$word);"
    The final get (imported by LWP::Simple) get the doc that is splitted and, if flip-flop is on, map do the print.

    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Have you visted the OPs url? Its full of javascript; WWW::Mechanize is built out of LWP , neither can help with javascript