Dear All,
I have been working on a personal project trying to scrap the price information on those low cost airline company website. PerlMonks website has provided lots of useful information and I learnt the following things during the research:
1) use of WWW::Mechanize:Firefox to handle javascript
2) use of xpath to wait till certain element being generated
3) use of LiveHTTPHeaders Firefox plugin to study the msg being sent
The following is the problem which I encountered and appreciate if anyone can give me some direction:
1) I am trying to retrieve the price of a low cost airline for a month of price data
http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2CENUS
2) After loading the page, there is a form to construct your query. Say I choose the following:
"One way", From Osaka - Kansai to Hong Kong, then further tick "Low Fare Calendar" to show the data of the whole month
3) After clicking on "Search", I notice the link remains unchanged. So after the research, I learn to use LiveHTTPHeaders to study what's going on behind. The extracted info as following:
http://book.flypeach.com/WebService/B2cService.asmx/GetLowFareFinderMo +nth POST /WebService/B2cService.asmx/GetLowFareFinderMonth HTTP/1.1 Host: book.flypeach.com User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/2010010 +1 Firefox/37.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0. +8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Content-Type: application/json; charset=utf-8 Referer: http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2 +CENUS Content-Length: 296 Cookie: __utma=134592366.821879741.1427710303.1428247408.1428251315.9; + __utmz=134592366.1427710303.1.1.utmcsr=(direct)|utmccn=(direct)|utmc +md=(none); _ga=GA1.2.821879741.1427710303; SERVERID=book6; ASP.NET_Se +ssionId=ciq5gzqapgonocmls3oghw45; __utmc=134592366; __utmb=134592366. +1.10.1428251315; __utmt=1 Connection: keep-alive Pragma: no-cache Cache-Control: no-cache {"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"} HTTP/1.1 200 OK
So now I know the following should be the key info to solve my problem:
1) The link for querying the data is http://book.flypeach.com/WebService/B2cService.asmx/SearchLowFareSingleMonth
2) This website makes use of JSON:
{"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"}
So I try to write my Perl script as following:
#!/bin/usr/perl # use warnings; use strict; use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new( tab => 'current', ); $mech->post('http://book.flypeach.com/WebService/B2cService.asmx/Sea +rchLowFareSingleMonth'); my $retries = 10; while ($retries-- and ! $mech->is_visible( xpath => '//*[@id="dvAv +ailabilitySearch"]' )) { sleep 1; }; die "Timeout" if 0 > $retries; # Now the element exists #$mech->click({xpath => '//*[@id="ctl00_ +dvOutwardResult"]'}); print $mech->content;
Sorry for taking that long to state the questions that I have:
1) Could I simulate the http headers which I captured to retrieve the data?
2) I tried to use $mech->add_header, however, for Content-Type instead of getting "application/json", it'll become "application/x-www-form-urlencoded" and end up as bad http header.
3)I searched Mechanize and Mechanize:Firefox 's help, cookbook, example, stackoverflow, but still not sure how to add the JSON data into my request. Can anyone teach me how to do this?
I believe I am pretty close to get what I need. But I cannot find the clue how to write the Perl script with Mechanize:Firefox for the required JSON/AJAX request.
Thanks in advance for the help
In reply to Web page scraping from AJAX page with POST JSON data by ronstudio
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |