Hello, gentlemen. I work for a company that scrapes web-sites for specific data. Our primary search engines are Perl scripts. For the most part, I understand and can write in Perl, however, I am fairly new at using this language (SEP2012). My current dilemma involves (I believe) cookies. The original site that I visited was at www.profilhotels.se and I discovered a short-cut that gets me to the page with the data with a single URL. For example, http://online.techotel.dk/domino.aspx?hotelid=48640&lang=en&p_arr=20130322_0000&p_dep=20130323_0000&p_pax=2_0_0 gets me directly to the data for the Hotel Riddargatan in Stockholm. I can clearly pick out the variable information (i.e. hotelid, lang, p_arr, p_dep, p_pax) which allows one to select any hotel, date language and occupancy as long as you know the hotelid. This is what we do, what most of our scripts do. We have one URL that we populate with specific parameters and then we open that URL and begin scraping. I have tried this manually, in my browser and even in a fresh new (no possibility of cookies) browser and when I view the source, the data is present. However, when I attempt this same feat using the perl script below, I do not get the same HTML page content; the data is not present and so the code simply never finds anything. In brief, H-E-L-P....

use strict; use WWW::Mechanize; use Date::Calc qw(Add_Delta_Days); use HTTP::Cookies; use Encode; #---------------------------------------------------------------# my $SiteID = 1475; my $commandline =''; while (@ARGV) { $commandline = "$commandline"." ".shift(@ARGV); +} my @commargs; @commargs = split /\+/, $commandline; my $requestqid = $commargs[0]; my $arrive = $commargs[1]; my $hotelid = $commargs[2]; my $los = $commargs[3]; my $debug = $commargs[4]; my $city = $commargs[6]; my $resultName = $commargs[9]; my $propertyid = $commargs[10]; my $currency = $commargs[16]; my $htmlpath = $commargs[20]; my $ratespath = $commargs[21]; my $occupancy = $commargs[23]; #--------------------------------------------------------------------- +------# #-------------------######## SET DEFAULTS ########-------------- +------# #--------------------------------------------------------------------- +------# my $outStr =''; ## This will contain the data for o +utput file my $htmlPage =''; ## This will keep the response from the + websites my $depart =''; my (undef,undef,undef,$day,$mon,$year,undef,undef,undef) = localtime(t +ime); $year+=1900; $mon+=1; $day+=1; my $tomorrow =$day."/".$mon."/".$year; my $out ="$ratespath/OUT"."$requestqid".".txt"; my $outHTML ="$htmlpath/HTML"."$requestqid".".html"; $requestqid =~ s/ //g; if($occupancy eq undef || $occupancy eq '') { $occupancy = '2'; + } if($currency eq undef || $currency eq '') { $currency = 'USD'; + } if($arrive eq undef || $arrive eq '') { $arrive = $tomor +row; } formatDates ('YYYYMMDD'); #---------------------------------------------------------------# my $agent = WWW::Mechanize->new(); $agent->timeout(240); #$agent->cookie_jar(HTTP::Cookies->new(file => "$htmlpath/$requestqid" +."_lwpcookies.txt",autosave => 1)); #print "Setting browser alias...\n"; #my $browser_al = setbrowser(); #print "(1) Using brower Alias: $browser_al\n"; #$agent->agent_alias($browser_al); my $url="http://online.techotel.dk/"; my $response=''; #---------------------------------------------------------------# if(validate()) { if(getHomePage($url)) { parseData($htmlPage); } } $outStr.="+++EOF+++"; writeToFile($out,$outStr); undef($agent); exit; #---------------------------------------------------------------# # # # Functions # # # #---------------------------------------------------------------# sub validate { $propertyid=~s/^\s+|\s+$//isg; if(dateCompare($arrive,$tomorrow,"YYYYMMDD") < 0) { throwError("Start date ($arrive) is less than current date ($t +omorrow)- ABORTED.",'EMSG_MI2'); return 0; } if($propertyid eq '') { throwError('Property ID Field Missing - ABORTED.','XSTP_PRP'); return 0; } if ($los eq undef || $los eq '') { throwError('LOS Field Missing - ABORTED.','EMSG_MI2'); return 0; } return 1; } #---------------------------------------------------------------# sub formatDates { my ($FMT) = @_; my($amon, $aday, $ayear) = ($arrive =~ /(\d+)\/(\d+)\/(\d+)/); my($dyear, $dmon, $dday) = Add_Delta_Days($ayear,$amon,$aday, $los +); if($FMT eq "MM/DD/YYYY") { $arrive = substr('00'.int($amon),-2).'/'.substr('00'.int($a +day),-2).'/'.substr('0000'.int($ayear),-4); $depart = substr('00'.int($dmon),-2).'/'.substr('00'.int($d +day),-2).'/'.substr('0000'.int($dyear),-4); } elsif($FMT eq "DD/MM/YYYY") { $arrive = substr('00'.int($aday),-2).'/'.substr('00'.int($a +mon),-2).'/'.substr('0000'.int($ayear),-4); $depart = substr('00'.int($dday),-2).'/'.substr('00'.int($d +mon),-2).'/'.substr('0000'.int($dyear),-4); } elsif($FMT eq "YYYYMMDD") { $arrive = substr('0000'.int($ayear),-4).substr('00'.int($am +on),-2).substr('00'.int($aday),-2); $depart = substr('0000'.int($dyear),-4).substr('00'.int($dm +on),-2).substr('00'.int($dday),-2); } print "FORMAT: $FMT\n"; print "ARRIVE: $arrive\n"; print "DEPART: $depart\n"; } #---------------------------------------------------------------# sub dateCompare { my ($d1,$d2,$FMT) = @_; $d1=~s/\s//g; $d2=~s/\s//g; if($FMT = "YYYYMMDD") { if(int($d1)<int($d2)) { return -1; } } return 0; } #---------------------------------------------------------------# sub getHomePage { my ($url)=@_; $url = $url.'domino.aspx?'.'hotelid='.$propertyid.'&lang=en&p_arr= +'.$arrive.'&p_dep='.$depart.'&p_pax='.$occupancy.'_0_0'; print "\nGetting home page--->URL: $url \n"; $response=$agent->get($url); sleep(15); if($response->is_success) { $htmlPage = $agent->{content}; writeToFile($outHTML,$htmlPage,1); return 1; } else { throwError('Could not get correct response to get home page.', +'EMSG_MI5'); return 0; } } #---------------------------------------------------------------# sub finalMessage { my ($descript,$rate,$cur,$status)=@_; $outStr.="$requestqid\::$SiteID\::$hotelid\::$arrive\::$los\::$des +cript\::$rate\::$cur\::$status\n"; } #---------------------------------------------------------------# sub throwError { my ($errorStr,$statusType)=@_; if($statusType=~/EMSG_MI5/is){ print " Navigation to site failed.\n"; print $url.' not responding... ABORT SCRIPT!\n'; $outStr.="$requestqid\::$SiteID\::$hotelid\::$arrive\::$los\ +::$errorStr\::0.00\::XXX\::$statusType\n"; } elsif($statusType=~/SOLD_OUT/is){ print " Sold out!\n"; $outStr.="$requestqid\::$SiteID\::$hotelid\::$arrive\::$los\: +:$errorStr\::\::XXX\::$statusType\n"; } else{ print "\n $errorStr \n\n"; $outStr.="$requestqid\::$SiteID\::$hotelid\::$arrive\::$los\:: +$errorStr\::0.00\::XXX\::$statusType\n"; } return 0; } #---------------------------------------------------------------# sub writeToFile { my ($fileName,$content,$htmlFlag)=@_; if($htmlFlag==1) { open OUT, ">:utf8", $fileName or die "Cannot open $fileName +for write :$!"; print OUT "$content"; close OUT; } else { open OUT, ">$fileName" or die "Cannot open $fileName for wri +te :$!"; print OUT "$content"; close OUT; } } #---------------------------------------------------------------# sub parseData { my ($data) = @_; my $flg = 0; #print "$data\n"; while(length($data)>0) { $flg = 0; print length($data); print "\n"; my ($room,$curr,$rate,$descript) = ("","","",""); if($data=~/dominoroomtypeprice(.*)/is)#[.][\>]([.*]):\<br /is) { print "found room\n"; $room = $1; $data = $'; $flg = 1; if($data=~/roomtypeprice[.][\>]([.]{3}).([.*])\</is) { print "found rate\n"; $curr = $1; $rate = $2; $data = $'; $flg = 2; if($data=~/RoomtypedesctextLbl_1[.][\>]([.*])\</is) { print "found desc\n"; $descript = $1; $data = $'; $flg = 4; } } } if($flg>0) { print "$room,$curr,$rate,$descript\n"; finalMessage ($room,$curr,$rate,$descript); } } } sub setbrowser { my $range = 2; my @browser_alias = ('Windows Mozilla','Mac Safari'); my $random_number = int(rand($range + 1)); my $browser = $browser_alias[$random_number-1]; {return $browser}; } #----------------------------------++EOF++---------------------------- +------#

In reply to Cookie Help by prosetto@msn.com

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.