Screen scraping complex tables and divs

parser has asked for the wisdom of the Perl Monks concerning the following question:

I have been screen scraping for a few years with WWW::Mechanize and HTML::TokeParser and they have served me well. However, I recently encountered a set of pages which use complex table structures and numerous tab divs. I need a module (or methodology) which will allow me to search for sections of HTML in a more jQuery find()-like manner rather than simply consuming tokens from a stream of HTML.

I read through the post The State of Web spidering in Perl and, while helpful, the focus is more on spidering than scraping. I am interested in recommendations from the Monks if there are higher-order methods of finding contructs in HTML using Perl besides regular expressions and token parsing.

I read Mahmoud's jquery module on CPAN with interest but it appears not to have been maintained since 2013 and and I am uncertain it can query on table structures. To be fair, jQuery is limited on querying unlabeled table structures as well.

Here is a small example of what I am trying to accomplish:
1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs.
2) Slurp in every row in a named table and parse out the name value pairs.

Cheers!

Comment on Screen scraping complex tables and divs

Replies are listed 'Best First'.
Re: Screen scraping complex tables and divs (updated) by LanX (Saint) on Oct 13, 2017 at 19:18 UTC
I'm confused because the thread you linked to is already very good. You mostly use CSS selector or XPath in live inspections (ie when you need browser for JS) and as far as I remember did WWW::Mechanize::Firefox and its various siblings support both. The alternative is mirroring the DOM into a Perl/XML data structure and using the query API. (Mostly like xpath) Maybe you should ask more precisely and show what you tried? update > 1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs. > 2) Slurp in every row in a named table and parse out the name value pairs. See `$mech->xpath( $query, %options)` and alternatively `$mech->select( $name, $value )` Both methods support querying children elements of a given ID. Query syntax is not a Perl question, but there are plenty of good tutorials online. Look out for browser features/addons allowing to play around with queries. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^2: Screen scraping complex tables and divs (updated) by parser (Acolyte) on Oct 13, 2017 at 21:42 UTC
Rolf, I am confused now too. Are you saying WWW::Mechanize supports CSS selector and XPath? Or that WWW::Mechanize::Firefox does? If the latter, I also read it was very difficult to build. Query syntax is not a Perl question, but there are plenty of good tutorials online. I agree. However, determining how best to query HTML source via Perl is. The option of mirroring the DOM into a Perl/XML data structure and using the query API sounds quite good. I'll give that a go and see how it works. Anything is better than parsing table tags with TokParser.	[reply]
Re^3: Screen scraping complex tables and divs (updated) by Corion (Patriarch) on Oct 14, 2017 at 06:44 UTC
As a very simplicistic application of combining WWW::Mechanize with HTML::TreeBuilder and HTML::Selector::XPath, I wrote App::scrape. This module encapsulates extracting data from HTML either via CSS selectors or XPath queries. Maybe you can use that as a starting point.	[reply]
Re^3: Screen scraping complex tables and divs by LanX (Saint) on Oct 13, 2017 at 21:53 UTC
WWW::Mechanize::Firefox does and I took it as an example out of many because I worked with it in the past. But it really depends if you need JS or not, so I don't want to go into details. Querying Html was your question, something like xpath or css selector is mostly the solution. Regarding the Perl backend: it depends. Sorry there is no generic answer for TIMTOWTDI . Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: Screen scraping complex tables and divs (Firepath) by LanX (Saint) on Oct 14, 2017 at 22:43 UTC
PS: > > Look out for browser features/addons allowing to play around with queries. I had very good experience using Firepath to find the right CSS selectors / XPath expressions inside Firefox. You can copy an auto-generated explicit expression by right clicking on a DOM-element and change them interactively. Simply copy the final path and/or selector into your Perl code then. HTH! :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^4: Screen scraping complex tables and divs (Firepath) by parser (Acolyte) on Oct 17, 2017 at 16:45 UTC
Re: Screen scraping complex tables and divs by marto (Cardinal) on Oct 13, 2017 at 20:56 UTC
Mojolicious provides Mojo::DOM, and it makes life much simpler if you can use css selectors. In this example I use Mojolicious to parse a page and download associated links. If you run into problems post what you've tried and an example of the HTML you have to work with.	[reply]
Re^2: Screen scraping complex tables and divs by parser (Acolyte) on Oct 13, 2017 at 21:43 UTC
Thank you Marto, I will check out mojolicious. Update: Mojo::DOM is perfect! It combines both CSS selectors and XML DOM parsing and has eliminated about 60% of my existing code.	[reply]
Re: Screen scraping complex tables and divs by Anonymous Monk on Oct 13, 2017 at 22:33 UTC
[id://1055183...focus is more on spidering than scraping. See Re^3: The State of Web spidering in Perl that looks like scraping examples to me	[reply]

update