Web Crawling using Perl

Here, I'm going to discuss all the steps to do a web crawling using any language or technology. Crawler/Scrapper/Spider/Bot/ multiple synonyms for same stuff which is basically meant to copy content from any site. Based on website crawlers need to be configured. Please share your feedback, if it's good then I can post it on a larger platform as well where I'm planning to discuss crawling in java etc as well:

1. Static Content Crawling: Static content are those which are generated once and doesn’t keeps on changing on their own, they need manual intervention and update to push any changes in content. So, such pages are easy to create but such pages are highly prone to security and their content can be crawled easily. So, in this case your crawler would be very efficient and providing accurate result as they simply have to access a page and get details.

2. Dynamic Content Crawling: Dynamic content are those which keeps changing dynamically, in this case page contains server-side code which allows the server to generate a unique content whenever the page is loaded. Since, these contents are dynamically generated using different technology hence they are very secure as well as very difficult to crawl. So, in this case the crawler creation would be very difficult and getting accurate data is also not possible as class names etc are generated on load and there is no html code available on page to get that. Hence, you may have to use different functionalities here such as using proxy or creating robot.txt.

Static Content Crawling

my $mech = WWW::Mechanize->new();
my $response = $mech->get(URL);
 if ($response->is_success) {
        print $mech->content;
 }
 else {
     die $response->status_line;
 }
[download]

Dynamic Content Crawling

use WWW::mechanize::Firefox;
use Data::Dumper;
$mech= WWW::Mechanize::Firefox->new();

$mech->get(URL);

%arr_ref = (AL => [1795, 1276, 795, 1719, 1363, 1145, 961, 17, 18, 199
+5, 977, 1910, 1691, 21, 1660, 1768], 
AK => [1145, 961, 1995, 977, 1781, 1704], 
AZ => [1873, 872, 1145, 690, 1162, 961, 918, 528, 811, 704, 529, 1983,
+ 931, 40, 1995, 977, 597, 1157, 530, 598, 886, 782, 42, 691, 1945]);

foreach my $key (sort keys %arr_ref) {
    print "$key :: @{$arr_ref{$key}} \n";
    $mech->field( stateUSAId => $key );
    foreach  (@{$arr_ref{$key}}) {
            $mech->field(institutionUSAId=>$_);
            print $mech->content;
    }

}
[download]

Comment on Web Crawling using Perl Select or Download Code

Replies are listed 'Best First'.
Re: Web Crawling using Perl by stevieb (Canon) on Jun 24, 2017 at 13:32 UTC
One way to tidy things up a bit would be to stuff all of your `if` conditions into a hash, then do a single `if` check. First, I put the statement with all the states into a temporary script: use warnings; use strict; use Data::Dumper; my $str = <DATA>; my %states = map {$_ => 1} $str =~ /'([A-Z]{2})'/g; print Dumper \%states; __DATA__ if($link->text ne 'AK' && $link->text ne 'KY' && $link->text ne 'AS' + && $link->text ne 'MA' && $link->text ne 'MI' && $link->text ne 'CO' + && $link->text ne 'DC' && $link->text ne 'GA' && $link->text ne 'IN' + && $link->text ne 'MD' && $link->text ne 'CT' && $link->text ne 'AR' + && $link->text ne 'ID' && $link->text ne 'IL' && $link->text ne 'CA' + && $link->text ne 'AL' && $link->text ne 'ME' && $link->text ne 'DE' + && $link->text ne 'GU' && $link->text ne 'FL' && $link->text ne 'IA' + && $link->text ne 'LA' && $link->text ne 'HI' && $link->text ne 'KS' + && $link->text ne 'AZ') [download] From the output, copy the following part and paste it back into the original script in the form of a hash: `'IN' => 1, 'FL' => 1, 'MD' => 1, 'MA' => 1, 'GU' => 1, 'DE' => 1, 'ID' => 1, 'KS' => 1, 'IA' => 1, 'LA' => 1, 'KY' => 1, 'ME' => 1, 'AR' => 1, 'HI' => 1, 'AK' => 1, 'GA' => 1, 'MI' => 1, 'AZ' => 1, 'CO' => 1, 'DC' => 1, 'AS' => 1, 'CA' => 1, 'IL' => 1, 'AL' => 1, 'CT' => 1` [download] Original script: `... my %states = ( 'IN' => 1, 'FL' => 1, 'MD' => 1, 'MA' => 1, 'GU' => 1, 'DE' => 1, 'ID' => 1, 'KS' => 1, 'IA' => 1, 'LA' => 1, 'KY' => 1, 'ME' => 1, 'AR' => 1, 'HI' => 1, 'AK' => 1, 'GA' => 1, 'MI' => 1, 'AZ' => 1, 'CO' => 1, 'DC' => 1, 'AS' => 1, 'CA' => 1, 'IL' => 1, 'AL' => 1, 'CT' => 1 );` [download] Now, your if statement can be reduced to the following, and you also can clearly see which states you have easily and can add/remove without scrolling through a great big statement: `if (! $states{$link->text}){ ... }` [download] With all that said, I'd probably put the list of states into an array so it takes up less vertical space, and create the hash from the array instead: `my @state_abbrs = qw( IL MI AZ CT ... CO DC AS GA ... ); my %states = map {$_ => 1} @state_abbrs; if (! $states{$link->text}){ ... }` [download]	[reply] [d/l] [select]
Re: Web Crawling using Perl -- TIMTOWTDT by Discipulus (Canon) on Jun 24, 2017 at 13:49 UTC
Hello ckj the subject is very interesting, just I do not understand when you say: > if it's good then I can post it on a larger platform There is something larger than perlmonks? (kidding but perhaps you must ask author the permission to repost their code on other platform.. for example I'd prefere you just link to my present post from wherever you want) Now I'm not an expert at scraping the web but there is something much simpler than use while starting scraping; consider what I use to extract titles from nodes I want to bookmark, when I also add some html tags to put the result into an unordered list: `io@COMP:C> perl -MLWP::UserAgent -e "print qq(<li>[id://$ARGV[0]\|).LWP +::UserAgent->new->get('http://www.perlmonks.org/index.pl?node_id='.$A +RGV[0])->title,']</li>'" 1193449 <li>[id://1193449\|Web Crawling using Perl]</li>` [download] Just one step forward you can get the content using few lines: `use strict; use warnings; use LWP::UserAgent (); my $ua = LWP::UserAgent->new; my $response = $ua->get('http://www.perlmonks.org/?node_id=1193449'); if ($response->is_success) { print $response->decoded_content; } else { die $response->status_line; }` [download] Also scraping it is not just "to copy content" you can exctract or examine the response. I've done this into my webtimeload023.pl: # just monitoring --verbosity 0 --count 4 --sleep 5 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 0 -c 4 -s 5 Sat Jun 24 15:34:09 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99562 2.126046 45.7321 Kb/s Sat Jun 24 15:34:16 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99599 1.986645 48.9592 Kb/s Sat Jun 24 15:34:23 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 99192 2.064141 46.9286 Kb/s Sat Jun 24 15:34:30 2017 http://www.perlmonks.org/?node_id=1193449 20 +0 98852 1.972459 48.9415 Kb/s # some detail more with --verbosity 4 perl webtimeload023.pl -u http://www.perlmonks.org/?node_id=1193449 -v + 4 ==================================================================== http://www.perlmonks.org/?node_id=1193449 Sat Jun 24 15:34:40 20 +17 -------------------------------------------------------------------- Response code: 200 Response message: OK Response server: Apache/2.4.26 Response declared length: UNDEF Response title: Web Crawling using Perl -------------------------------------------------------------------- main page content (1): 31.8506 Kb in 1.248003 seconds @ 25.5212 Kb/s) -------------------------------------------------------------------- detail of loaded pages (url): -------------------------------------------------------------------- http://www.perlmonks.org/?node_id=1193449 -------------------------------------------------------------------- no included content found. external content (1): 64.6660 Kb in 0.723429 seconds @ 89.3882 Kb/s) no broken links found. -------------------------------------------------------------------- detail of loaded content (url bytes seconds): -------------------------------------------------------------------- http://promote.pair.com/i/pair-banner-current.gif 66218 0.7234 +29 -------------------------------------------------------------------- downloaded 96.5166 Kb (98833 bytes) in 1.971432 seconds (48.9576 Kb/s) ==================================================================== [download] As you can see I just experimented with LWP::UserAgent.. Many more good possibilities are present on CPAN, for scraping and for parsing results. In a mixed order WWW::Mechanize and all his children (Firefox, Chrome, PhantomJs being by Corion) HTML::TreeBuilder Mojo::UserAgent Mojo::DOM Web::Scraper LWP::Simple App::scrape by our dear monk Corion You can also be interested to read following wise threads: The State of Web spidering in Perl Screen scraping html parsing Re^3: Begginer's question: If loops one after the other. Is that code correct? Re: REGEX for url Above threads link to many other useful information; other links are in my homenode. When next year (;=) you have tried them all I'll be very glad to see your opinion. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Web Crawling using Perl -- TIMTOWTDT by ckj (Chaplain) on Jun 24, 2017 at 14:24 UTC
Thanks for your valuable feedback Discipulus , I already started to go through the information you shared and will get back with my opinion(Next year is way too far). The code I pasted here was a long time back code written by me and yes a simpler code would make more sense to put rather than displaying all functionalities. Lastly, regarding "> if it's good then I can post it on a larger platform" the actual meaning of my statement was actually about putting this article in a bigger article where crawling in multiple languages has been explained :) . Thanks again for your feedback, I can probably work on my article in a better way now.	[reply]