ronstudio has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

I have been working on a personal project trying to scrap the price information on those low cost airline company website. PerlMonks website has provided lots of useful information and I learnt the following things during the research:

1) use of WWW::Mechanize:Firefox to handle javascript

2) use of xpath to wait till certain element being generated

3) use of LiveHTTPHeaders Firefox plugin to study the msg being sent


The following is the problem which I encountered and appreciate if anyone can give me some direction:

1) I am trying to retrieve the price of a low cost airline for a month of price data

http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2CENUS


2) After loading the page, there is a form to construct your query. Say I choose the following:

"One way", From Osaka - Kansai to Hong Kong, then further tick "Low Fare Calendar" to show the data of the whole month


3) After clicking on "Search", I notice the link remains unchanged. So after the research, I learn to use LiveHTTPHeaders to study what's going on behind. The extracted info as following:

http://book.flypeach.com/WebService/B2cService.asmx/GetLowFareFinderMo +nth POST /WebService/B2cService.asmx/GetLowFareFinderMonth HTTP/1.1 Host: book.flypeach.com User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/2010010 +1 Firefox/37.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0. +8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Content-Type: application/json; charset=utf-8 Referer: http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2 +CENUS Content-Length: 296 Cookie: __utma=134592366.821879741.1427710303.1428247408.1428251315.9; + __utmz=134592366.1427710303.1.1.utmcsr=(direct)|utmccn=(direct)|utmc +md=(none); _ga=GA1.2.821879741.1427710303; SERVERID=book6; ASP.NET_Se +ssionId=ciq5gzqapgonocmls3oghw45; __utmc=134592366; __utmb=134592366. +1.10.1428251315; __utmt=1 Connection: keep-alive Pragma: no-cache Cache-Control: no-cache {"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"} HTTP/1.1 200 OK

So now I know the following should be the key info to solve my problem:

1) The link for querying the data is http://book.flypeach.com/WebService/B2cService.asmx/SearchLowFareSingleMonth

2) This website makes use of JSON:

{"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"}

So I try to write my Perl script as following:

#!/bin/usr/perl # use warnings; use strict; use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new( tab => 'current', ); $mech->post('http://book.flypeach.com/WebService/B2cService.asmx/Sea +rchLowFareSingleMonth'); my $retries = 10; while ($retries-- and ! $mech->is_visible( xpath => '//*[@id="dvAv +ailabilitySearch"]' )) { sleep 1; }; die "Timeout" if 0 > $retries; # Now the element exists #$mech->click({xpath => '//*[@id="ctl00_ +dvOutwardResult"]'}); print $mech->content;

Sorry for taking that long to state the questions that I have:

1) Could I simulate the http headers which I captured to retrieve the data?

2) I tried to use $mech->add_header, however, for Content-Type instead of getting "application/json", it'll become "application/x-www-form-urlencoded" and end up as bad http header.

3)I searched Mechanize and Mechanize:Firefox 's help, cookbook, example, stackoverflow, but still not sure how to add the JSON data into my request. Can anyone teach me how to do this?

I believe I am pretty close to get what I need. But I cannot find the clue how to write the Perl script with Mechanize:Firefox for the required JSON/AJAX request.

Thanks in advance for the help

Replies are listed 'Best First'.
Re: Web page scraping from AJAX page with POST JSON data
by Corion (Patriarch) on Apr 05, 2015 at 17:35 UTC

    I'm not sure how to change the Content-Type header with HTTP::Request::Common (for plain WWW::Mechanize), but it should give you at least a HTTP::Request whose Content-Type you can change later on:

    use strict; use HTTP::Request::Common; my $request= POST 'http://book.flypeach.com/WebService/B2cService.asmx +/SearchLowFareSingleMonth', Content_Type => 'application/json', Content => '{"strFromAirport":"KIX","strToAirport":"HKG","dep +artMonth":"20150606","returnMonth":"20150606","iOneWay":"true","iAdul +t":2,"iChild":0,"iInfant":0,"BoardingClass":"","CurrencyCode":"JPY"," +strPromoCode":"","SearchType":"FARE","iOther":0,"otherType":"","stIpA +ddress":"","strCurrentDate":"20150406"}', ; print $request->as_string; __END__ POST http://book.flypeach.com/WebService/B2cService.asmx/SearchLowFare +SingleMont h Content-Length: 295 Content-Type: application/json {"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","stpAddress":"","strCurrent +Date":"20150406"}

    If you want to keep using WWW::Mechanize::Firefox, note that its support to add custom headers is somewhat limited, but the same approach could still work. But if you already use WWW::Mechanize::Firefox, why not just keep automating the complete website and programmatically click your way to the data you want?

      Hi Corion, thanks for your reply!

      I'm just didn't realize I can further navigate the website with WWW:Mechanize:Firefox. I were just keep thinking of if there is any possible way to extract those price data table directly.

      Let me dig further regarding this, thanks very much!!!

      btw, are you the author of http://corion.net/talks/web-scraping-with-perl/web-scraping-with-perl.en.html ? I have read some example in there. Thanks very much for your help and info from that page

        Hi Corion,

        After a good sleep and your advice, I found that I can use the key "data" to represent those JSON value and successfully sending the http request exactly the same as manual browsing.

        In order to get the price table which I need, somehow I need to browse through the site with some sequences of page (first the form dialogue, then sending the json header). At least I can get what I need following your advice to simulate what the browser is doing manually!

        Thanks~~