1. Static Content Crawling: Static content are those which are generated once and doesn’t keeps on changing on their own, they need manual intervention and update to push any changes in content. So, such pages are easy to create but such pages are highly prone to security and their content can be crawled easily. So, in this case your crawler would be very efficient and providing accurate result as they simply have to access a page and get details.
2. Dynamic Content Crawling: Dynamic content are those which keeps changing dynamically, in this case page contains server-side code which allows the server to generate a unique content whenever the page is loaded. Since, these contents are dynamically generated using different technology hence they are very secure as well as very difficult to crawl. So, in this case the crawler creation would be very difficult and getting accurate data is also not possible as class names etc are generated on load and there is no html code available on page to get that. Hence, you may have to use different functionalities here such as using proxy or creating robot.txt.
Static Content CrawlingDynamic Content Crawlingmy $mech = WWW::Mechanize->new(); my $response = $mech->get(URL); if ($response->is_success) { print $mech->content; } else { die $response->status_line; }
use WWW::mechanize::Firefox; use Data::Dumper; $mech= WWW::Mechanize::Firefox->new(); $mech->get(URL); %arr_ref = (AL => [1795, 1276, 795, 1719, 1363, 1145, 961, 17, 18, 199 +5, 977, 1910, 1691, 21, 1660, 1768], AK => [1145, 961, 1995, 977, 1781, 1704], AZ => [1873, 872, 1145, 690, 1162, 961, 918, 528, 811, 704, 529, 1983, + 931, 40, 1995, 977, 597, 1157, 530, 598, 886, 782, 42, 691, 1945]); foreach my $key (sort keys %arr_ref) { print "$key :: @{$arr_ref{$key}} \n"; $mech->field( stateUSAId => $key ); foreach (@{$arr_ref{$key}}) { $mech->field(institutionUSAId=>$_); print $mech->content; } }
In reply to Web Crawling using Perl by ckj
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |