parser has asked for the wisdom of the Perl Monks concerning the following question:

I have been screen scraping for a few years with WWW::Mechanize and HTML::TokeParser and they have served me well. However, I recently encountered a set of pages which use complex table structures and numerous tab divs. I need a module (or methodology) which will allow me to search for sections of HTML in a more jQuery find()-like manner rather than simply consuming tokens from a stream of HTML.

I read through the post The State of Web spidering in Perl and, while helpful, the focus is more on spidering than scraping. I am interested in recommendations from the Monks if there are higher-order methods of finding contructs in HTML using Perl besides regular expressions and token parsing.

I read Mahmoud's jquery module on CPAN with interest but it appears not to have been maintained since 2013 and and I am uncertain it can query on table structures. To be fair, jQuery is limited on querying unlabeled table structures as well.

Here is a small example of what I am trying to accomplish:
1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs.
2) Slurp in every row in a named table and parse out the name value pairs.

Cheers!
  • Comment on Screen scraping complex tables and divs

Replies are listed 'Best First'.
Re: Screen scraping complex tables and divs (updated)
by LanX (Saint) on Oct 13, 2017 at 19:18 UTC
    I'm confused because the thread you linked to is already very good.

    You mostly use

    in live inspections (ie when you need browser for JS) and as far as I remember did WWW::Mechanize::Firefox and its various siblings support both.

    The alternative is mirroring the DOM into a Perl/XML data structure and using the query API. (Mostly like xpath)

    Maybe you should ask more precisely and show what you tried?

    update

    > 1) Find the 6th and 9th rows in a named table (given an id) and pull out the name and value pairs.

    > 2) Slurp in every row in a named table and parse out the name value pairs.

    See

    • $mech->xpath( $query, %options)
    and alternatively
    • $mech->select( $name, $value )
    Both methods support querying children elements of a given ID.

    Query syntax is not a Perl question, but there are plenty of good tutorials online.

    Look out for browser features/addons allowing to play around with queries.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Rolf,

      I am confused now too. Are you saying WWW::Mechanize supports CSS selector and XPath? Or that WWW::Mechanize::Firefox does? If the latter, I also read it was very difficult to build.

      Query syntax is not a Perl question, but there are plenty of good tutorials online.

      I agree. However, determining how best to query HTML source via Perl is.

      The option of mirroring the DOM into a Perl/XML data structure and using the query API sounds quite good. I'll give that a go and see how it works. Anything is better than parsing table tags with TokParser.
        WWW::Mechanize::Firefox does and I took it as an example out of many because I worked with it in the past.

        But it really depends if you need JS or not, so I don't want to go into details.

        Querying Html was your question, something like xpath or css selector is mostly the solution.

        Regarding the Perl backend: it depends.

        Sorry there is no generic answer for TIMTOWTDI .

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        PS:

        > > Look out for browser features/addons allowing to play around with queries.

        I had very good experience using Firepath to find the right CSS selectors / XPath expressions inside Firefox.

        You can copy an auto-generated explicit expression by right clicking on a DOM-element and change them interactively.

        Simply copy the final path and/or selector into your Perl code then.

        HTH! :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Screen scraping complex tables and divs
by marto (Cardinal) on Oct 13, 2017 at 20:56 UTC

    Mojolicious provides Mojo::DOM, and it makes life much simpler if you can use css selectors. In this example I use Mojolicious to parse a page and download associated links. If you run into problems post what you've tried and an example of the HTML you have to work with.

      Thank you Marto,

      I will check out mojolicious.

      Update: Mojo::DOM is perfect! It combines both CSS selectors and XML DOM parsing and has eliminated about 60% of my existing code.

Re: Screen scraping complex tables and divs
by Anonymous Monk on Oct 13, 2017 at 22:33 UTC