Fetching a link of a webpage (harder than it seems)

danielson has asked for the wisdom of the Perl Monks concerning the following question:

Hey Folks, For the past weeks, I tried to extract a specific link of a webpage:
http://search.live.com/results.aspx?&mkt=en-us&scope=books#t=Ym0yV_awWwwdJ0lMJwJx5g&page=1&sq=
Right above the book page image, there is a "next page" button, that points towards:
http://search.live.com/results.aspx?&mkt=en-us&scope=books#t=M-BkRr2dwOhxBBOAGBtqhg&page=1&sq= This link is the only thing I'm interested in.

I tried to fetch the page with LWP::Simple and LWP::UserAgent and always got something totally different of what firefox or IE produce. Crucially, the one link I'm interested in is absent...
I know that javascript is not supported, but that doesn't seem to be the problem as this link does not even seem to appear in the javascript code (it's not there if I fetch all related webpages with wget).
BUT, if I do "save webpage, complete" with firefox, the link occurs both in the javascript file (livebooks2.js) and in the html page.

Now I don't even know where to start... Any ideas on which way to go? How are the page's "frames" built? I suspect I'm downloading the page before they were even "built". Is there a way I could let them appear before saving the page... Duh... this is confusing.

Daniel

PS: please note, that I have permission to perform this!

Comment on Fetching a link of a webpage (harder than it seems)

Replies are listed 'Best First'.
Re: Fetching a link of a webpage (harder than it seems) by Erez (Priest) on May 10, 2008 at 11:23 UTC
If you really want a good answer, we need to know what you are getting, and what do you expect to get. At this point, I'll risk a guess, since you say you get different results than you get from a browser, it might be that you need to set your useragent to a browser's one: `my $userAgent = LWP::UserAgent->new(); $userAgent->agent('Mozilla/5.0');` [download] The default one used by LWP:: is sometimes blocked or routed by different sites. Stop saying 'script'. Stop saying 'line-noise'. We have nothing to lose but our metaphors.	[reply] [d/l]
Re^2: Fetching a link of a webpage (harder than it seems) by danielson (Initiate) on May 10, 2008 at 11:46 UTC
It don't dare to send you the fetched file as it takes 26k (but I can!), but it is similar to the one you can see in firefox, except that the two frames are missing (the one on the left-hand side with book details, as well as the one on the right-hand side with page content and the "next page" button whose link I need). Hence, there is actually nothing between the search box and the bottom line " * © 2008 Microsoft \| * Privacy \| * Legal" etc. The specific bit I need is this: `<a id="PAGENEXT" href="#t=M-BkRr2dwOhxBBOAGBtqhg&page=1&sq=" title="Go to next page" class="button"><img src="http://books.live.com/s/books/page_next_normal.gif" class="icon navButtonIcon"></a>` Thanks for your suggestion, but I had already set the user agent.	[reply] [d/l]
Re^3: Fetching a link of a webpage (harder than it seems) by Anonymous Monk on May 10, 2008 at 12:10 UTC
Thats AJAX, you'll never find that link, so use the SDK http://search.live.com/developer/	[reply]
Re: Fetching a link of a webpage (harder than it seems) by jethro (Monsignor) on May 10, 2008 at 12:18 UTC
I would use WWW::Mechanize for something like this, it's a module specially made for navigating through web sites. Also you might use a packet sniffer to find out the difference between what firefox and what your script sends to the web server. Probably there are also tools higher up in the protocol stack that can protocol the communication	[reply]
Re: Fetching a link of a webpage (harder than it seems) by Gangabass (Vicar) on May 10, 2008 at 14:09 UTC
This is Javascript magic. You can install Firefox extension Live HTTP headers to see what requests your browser are sending (and responses too). And of course you can emulate this Javascript request in Perl.	[reply]
Re^2: Fetching a link of a webpage (harder than it seems) by jethro (Monsignor) on May 10, 2008 at 14:52 UTC
Exactly. But be prepared that the website might change the javascript slightly on each invocation of the website (especially in the form of keys for a sort of challenge response protocol) that you would have to extract from the code.	[reply]