Dear All,

I have been working on a personal project trying to scrap the price information on those low cost airline company website. PerlMonks website has provided lots of useful information and I learnt the following things during the research:

1) use of WWW::Mechanize:Firefox to handle javascript

2) use of xpath to wait till certain element being generated

3) use of LiveHTTPHeaders Firefox plugin to study the msg being sent


The following is the problem which I encountered and appreciate if anyone can give me some direction:

1) I am trying to retrieve the price of a low cost airline for a month of price data

http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2CENUS


2) After loading the page, there is a form to construct your query. Say I choose the following:

"One way", From Osaka - Kansai to Hong Kong, then further tick "Low Fare Calendar" to show the data of the whole month


3) After clicking on "Search", I notice the link remains unchanged. So after the research, I learn to use LiveHTTPHeaders to study what's going on behind. The extracted info as following:

http://book.flypeach.com/WebService/B2cService.asmx/GetLowFareFinderMo +nth POST /WebService/B2cService.asmx/GetLowFareFinderMonth HTTP/1.1 Host: book.flypeach.com User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/2010010 +1 Firefox/37.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0. +8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Content-Type: application/json; charset=utf-8 Referer: http://book.flypeach.com/default.aspx?langculture=en-US&ao=B2 +CENUS Content-Length: 296 Cookie: __utma=134592366.821879741.1427710303.1428247408.1428251315.9; + __utmz=134592366.1427710303.1.1.utmcsr=(direct)|utmccn=(direct)|utmc +md=(none); _ga=GA1.2.821879741.1427710303; SERVERID=book6; ASP.NET_Se +ssionId=ciq5gzqapgonocmls3oghw45; __utmc=134592366; __utmb=134592366. +1.10.1428251315; __utmt=1 Connection: keep-alive Pragma: no-cache Cache-Control: no-cache {"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"} HTTP/1.1 200 OK

So now I know the following should be the key info to solve my problem:

1) The link for querying the data is http://book.flypeach.com/WebService/B2cService.asmx/SearchLowFareSingleMonth

2) This website makes use of JSON:

{"strFromAirport":"KIX","strToAirport":"HKG","departMonth":"20150606", +"returnMonth":"20150606","iOneWay":"true","iAdult":2,"iChild":0,"iInf +ant":0,"BoardingClass":"","CurrencyCode":"JPY","strPromoCode":"","Sea +rchType":"FARE","iOther":0,"otherType":"","strIpAddress":"","strCurre +ntDate":"20150406"}

So I try to write my Perl script as following:

#!/bin/usr/perl # use warnings; use strict; use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new( tab => 'current', ); $mech->post('http://book.flypeach.com/WebService/B2cService.asmx/Sea +rchLowFareSingleMonth'); my $retries = 10; while ($retries-- and ! $mech->is_visible( xpath => '//*[@id="dvAv +ailabilitySearch"]' )) { sleep 1; }; die "Timeout" if 0 > $retries; # Now the element exists #$mech->click({xpath => '//*[@id="ctl00_ +dvOutwardResult"]'}); print $mech->content;

Sorry for taking that long to state the questions that I have:

1) Could I simulate the http headers which I captured to retrieve the data?

2) I tried to use $mech->add_header, however, for Content-Type instead of getting "application/json", it'll become "application/x-www-form-urlencoded" and end up as bad http header.

3)I searched Mechanize and Mechanize:Firefox 's help, cookbook, example, stackoverflow, but still not sure how to add the JSON data into my request. Can anyone teach me how to do this?

I believe I am pretty close to get what I need. But I cannot find the clue how to write the Perl script with Mechanize:Firefox for the required JSON/AJAX request.

Thanks in advance for the help


In reply to Web page scraping from AJAX page with POST JSON data by ronstudio

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.