Hello ckj the subject is very interesting,

just I do not understand when you say:

> if it's good then I can post it on a larger platform

There is something larger than perlmonks? (kidding but perhaps you must ask author the permission to repost their code on other platform.. for example I'd prefere you just link to my present post from wherever you want)

Now I'm not an expert at scraping the web but there is something much simpler than use while starting scraping; consider what I use to extract titles from nodes I want to bookmark, when I also add some html tags to put the result into an unordered list:

io@COMP:C> perl -MLWP::UserAgent -e "print qq(<li>[id://$ARGV[0]|).LWP +::UserAgent->new->get('http://www.perlmonks.org/index.pl?node_id='.$A +RGV[0])->title,']</li>'" 1193449 <li>[id://1193449|Web Crawling using Perl]</li>

Just one step forward you can get the content using few lines:

use strict; use warnings; use LWP::UserAgent (); my $ua = LWP::UserAgent->new; my $response = $ua->get('http://www.perlmonks.org/?node_id=1193449'); if ($response->is_success) { print $response->decoded_content; } else { die $response->status_line; }

Also scraping it is not just "to copy content" you can exctract or examine the response. I've done this into my webtimeload023.pl:

# just monitoring --verbosity 0 --count 4 --sleep 5 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 0 -c 4 -s 5 Sat Jun 24 15:34:09 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99562 2.126046 45.7321 Kb/s Sat Jun 24 15:34:16 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99599 1.986645 48.9592 Kb/s Sat Jun 24 15:34:23 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99192 2.064141 46.9286 Kb/s Sat Jun 24 15:34:30 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 98852 1.972459 48.9415 Kb/s # some detail more with --verbosity 4 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 4 ==================================================================== http://www.perlmonks.org/?node_id=1193449 Sat Jun 24 15:34:40 20 +17 -------------------------------------------------------------------- Response code: 200 Response message: OK Response server: Apache/2.4.26 Response declared length: UNDEF Response title: Web Crawling using Perl -------------------------------------------------------------------- main page content (1): 31.8506 Kb in 1.248003 seconds @ 25.5212 Kb/s) -------------------------------------------------------------------- detail of loaded pages (url): -------------------------------------------------------------------- http://www.perlmonks.org/?node_id=1193449 -------------------------------------------------------------------- no included content found. external content (1): 64.6660 Kb in 0.723429 seconds @ 89.3882 Kb/s) no broken links found. -------------------------------------------------------------------- detail of loaded content (url bytes seconds): -------------------------------------------------------------------- http://promote.pair.com/i/pair-banner-current.gif 66218 0.7234 +29 -------------------------------------------------------------------- downloaded 96.5166 Kb (98833 bytes) in 1.971432 seconds (48.9576 Kb/s) ====================================================================

As you can see I just experimented with LWP::UserAgent.. Many more good possibilities are present on CPAN, for scraping and for parsing results. In a mixed order

You can also be interested to read following wise threads:

Above threads link to many other useful information; other links are in my homenode.

When next year (;=) you have tried them all I'll be very glad to see your opinion.

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

In reply to Re: Web Crawling using Perl -- TIMTOWTDT by Discipulus
in thread Web Crawling using Perl by ckj

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.