Re: need help in scrapping asp site

Please check the code:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my @urls = ('http://www.folkeferie.dk/da/ferier/Aktuelle-chartertilbud
+---afbudsrejser/');
foreach my $url (@urls) {
    $mech->get($url);
    my $hsh={};
    $links = $mech->find_all_links(url_regex=>qr/templates\/textPage\.
+aspx\?id/i, text_regex=>qr/Afbudsrejser/i);
    foreach my $link (@$links) {
            $url = $link->url_abs();
            print "\n\n\n".$url."\n\n";
            $mech->get($url);
            my $content = $mech->content();
            print $content;
           while ($content=~/tr class="bgrow1"><td>(.*?)<\/td><td clas
+s="countryValue">(.*?)<\/td><td class="destnameValue">(.*?)<\/td><td 
+class="hotelNameValue">(.*?)<\/td><td class="durationValue">(.*?)<\/t
+d><td align="RIGHT" class="priceValue"><a target="_blank" href="(.*?)
+">(.*?)<\/a><\/td>/gisxm) {
            $hsh->{'url'} = $6;
            $hsh->{'crap_id'} = '';
            $hsh->{'date'} = $1;
            $hsh->{'country'} = $2;
            $hsh->{'destination'} = $3;
            $hsh->{'trip_type'} = $4;
            $hsh->{'trip_length'} = $5;
            $hsh->{'price'}=$7;
            print "$hsh->{'date'}, $hsh->{'country'}, $hsh->{'destinat
+ion'}, $hsh->{'trip_type'}, $hsh->{'trip_length'}, $hsh->{'price'}, $
+hsh->{'crap_id'}, $hsh->{'url'}, $airport\n\n";
        }
    }
}
[download]

The site is developed in asp , so the source contents are not exact HTML format. That's why I am facing lots of problem in fetching data from this site.

Comment on Re: need help in scrapping asp site Download Code

Replies are listed 'Best First'.
Re^2: need help in scraping asp site by Athanasius (Cardinal) on Sep 06, 2012 at 07:31 UTC
When added to a regex, the `x` modifier tells the regex engine to ignore whitespace — that is, to omit the spaces, etc., in the regex from the pattern to be matched. So, if you are trying to match something like: `<td class="countryValue"> # ^ note the space` [download] and your regex has an `x` modifier, you must specify the space(s) to be matched explicitly. For example: `<td \s+ class="countryValue">` [download] That said, when I run your code with this fix applied: `while ($content =~ m! tr \s+ class="bgrow1"> <td> (.?) + # $1 </td> <td \s+ class="countryValue"> (.?) + # $2 country </td> <td \s+ class="destnameValue"> (.?) + # $3 destination </td> <td \s+ class="hotelNameValue"> (.?) + # $4 </td> <td \s+ class="durationValue"> (.?) + # $5 trip_length </td> <td \s+ align="RIGHT" \s+ class="priceValue"> <a \s+ target="_blank" \s+ href="(.?)"> + # $6 url (.?) + # $7 </a> </td> !gisxm)` [download] the regex still gets no matches, so there is more wrong than just the missing whitespace. (Or, there is more whitespace lurking in the target webpages than I have allowed for.) For further help from the monks, please follow the advice given above by davido, and reduce your problem to a minimal* code snippet demonstrating the problem and complete with representative data. BTW, the variable `$airport` is accessed in the final `print` statement, but never initialized. You would have seen this if you had begun the script with `use strict; use warnings;` [download] as Gangabass advised in Re: How to scraper ASP websites. Athanasius <°(((>< contra mundum	[reply] [d/l] [select]
Re^3: need help in scraping asp site by Anonymous Monk on Sep 06, 2012 at 07:44 UTC
Thanks for your reply, but m,y concern is that the $content is not having the contents in proper format due to which the regex also will not work. Since, the source code are having asp, javascript syntax. Please try to run this program and let me know if you're able to produce the output.	[reply]
Re^4: need help in scraping asp site by Corion (Patriarch) on Sep 06, 2012 at 07:57 UTC
WWW::Mechanize does not handle Javascript. The WWW::Mechanize documentation clearly states that.	[reply]
Re^5: need help in scraping asp site by Anonymous Monk on Sep 06, 2012 at 08:05 UTC
Re^6: need help in scraping asp site by Corion (Patriarch) on Sep 06, 2012 at 08:12 UTC
Some notes below your chosen depth have not been shown here
Re^4: need help in scraping asp site by marto (Cardinal) on Sep 06, 2012 at 08:42 UTC
As you've been told several times, WWW::Mechanize does not work with JavaScript. Also "the source code are having asp" Active_Server_Pages is a server side technology, you won't see ASP code on the page contents. No need to open a new thread since you already have one: How to scraper ASP websites.	[reply]
Re^5: need help in scraping asp site by Anonymous Monk on Sep 06, 2012 at 18:53 UTC
Re^6: need help in scraping asp site by Corion (Patriarch) on Sep 06, 2012 at 18:58 UTC