1. Static Content Crawling: Static content are those which are generated once and doesn’t keeps on changing on their own, they need manual intervention and update to push any changes in content. So, such pages are easy to create but such pages are highly prone to security and their content can be crawled easily. So, in this case your crawler would be very efficient and providing accurate result as they simply have to access a page and get details.
2. Dynamic Content Crawling: Dynamic content are those which keeps changing dynamically, in this case page contains server-side code which allows the server to generate a unique content whenever the page is loaded. Since, these contents are dynamically generated using different technology hence they are very secure as well as very difficult to crawl. So, in this case the crawler creation would be very difficult and getting accurate data is also not possible as class names etc are generated on load and there is no html code available on page to get that. Hence, you may have to use different functionalities here such as using proxy or creating robot.txt.
Static Content CrawlingDynamic Content Crawlingmy $mech = WWW::Mechanize->new(); my $response = $mech->get(URL); if ($response->is_success) { print $mech->content; } else { die $response->status_line; }
use WWW::mechanize::Firefox; use Data::Dumper; $mech= WWW::Mechanize::Firefox->new(); $mech->get(URL); %arr_ref = (AL => [1795, 1276, 795, 1719, 1363, 1145, 961, 17, 18, 199 +5, 977, 1910, 1691, 21, 1660, 1768], AK => [1145, 961, 1995, 977, 1781, 1704], AZ => [1873, 872, 1145, 690, 1162, 961, 918, 528, 811, 704, 529, 1983, + 931, 40, 1995, 977, 597, 1157, 530, 598, 886, 782, 42, 691, 1945]); foreach my $key (sort keys %arr_ref) { print "$key :: @{$arr_ref{$key}} \n"; $mech->field( stateUSAId => $key ); foreach (@{$arr_ref{$key}}) { $mech->field(institutionUSAId=>$_); print $mech->content; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Web Crawling using Perl
by stevieb (Canon) on Jun 24, 2017 at 13:32 UTC | |
|
Re: Web Crawling using Perl -- TIMTOWTDT
by Discipulus (Canon) on Jun 24, 2017 at 13:49 UTC | |
by ckj (Chaplain) on Jun 24, 2017 at 14:24 UTC |