Navigating through pages using WWW::Mechanize

cliff_t has asked for the wisdom of the Perl Monks concerning the following question:

Hello ! I am very new to scraping and right now I am writing my first script to get some data from a website. What I am trying to do is to navigate through all the pages, I think you know what I mean, those pages like Page1, Page2... My problem is that I don't know how to find what is the number of the last page in order to make all the parsing in a loop. Any help or ideas would be very appreciated ! I can tell you that I have counted 13 pages. The data is info about persons in a company. What happends is that every time I click the next page a post method is generated, and I noticed that only the post parameters change. So I was thinking to call the post method from a loop and pass all the post arguments, and change only the value for __EVENTARGUMENT which will get the number of the page. The url for all the pages remains the same

Comment on Navigating through pages using WWW::Mechanize

Replies are listed 'Best First'.
Re: Navigating through pages using WWW::Mechanize by LanX (Saint) on Sep 19, 2014 at 10:50 UTC
> I think you know what I mean, Nope... :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply]
Re: Navigating through pages using WWW::Mechanize by GotToBTru (Prior) on Sep 19, 2014 at 13:04 UTC
There is no "standard" for indicating page numbers that all websites must adhere to. A particular website might indicate it clearly via a header or something, but there is no guarantee. You will have to figure it out on your own by following links. 1 Peter 4:10	[reply]
Re: Navigating through pages using WWW::Mechanize by Corion (Patriarch) on Sep 19, 2014 at 15:54 UTC
Also see HTML::AutoPagerize for a library that has some knowledge/heuristics about how to find the "current" and the number of total pages.	[reply]
Re: Navigating through pages using WWW::Mechanize by Anonymous Monk on Sep 19, 2014 at 10:52 UTC
That depends very much on where you're getting the page numbers from. Are they in the HTML of the page, are they in the filename, etc.? If you could provide a small but representative sample of input so we can get an idea that would be very helpful. Please also see here for information on how to help us help you.	[reply]
Re: Navigating through pages using WWW::Mechanize by locked_user sundialsvc4 (Abbot) on Sep 19, 2014 at 11:56 UTC
Generally, web-pages of interest will contain one or more hyperlinks ... text strings of the form: `<a href="somewhere_of_interest">some visible tag</a>` ... which will appear somewhere in the HTML text that you retrieve in response to each request that you make. Your program’s task, then, is to retrieve a page, scan its content for hyperlinks “of interest to you” that contain targets (“somewhere of interest ...”) that you realize you haven’t visited yet. You add these targets to your program’s to-do list and continue until that list is finally exhausted. On the other hand, sometimes a web-page will bury its logic into JavaScript: an `onClick` handler, say, creates the URL and sends the browser to it. In that case, the simplest approach might be to pick-apart what the URLs look like and to loop over what they could be, until you get 404’s. Actual sourc-code is left as an exercise to the reader, or to another Monk with more time on his hands than I.