Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR

by ait (Hermit)
on Nov 25, 2022 at 12:43 UTC ( #11148377=perlquestion: print w/replies, xml ) Need Help??

ait has asked for the wisdom of the Perl Monks concerning the following question:

Hello ye monks!

UPDATE: the time seems to be proportional to the number of rows. The 2-3 second timing is for a table with 450 rows and 12 columns.
On a table with 120 rows, it's less than a second. Speculating: it would seem that relative to node xpath has some bug and may be scanning the whole table each time.

Anybody out there have any clue why this takes 2 to 3 seconds:

my @cells = $mech->xpath('.//td', node => $rows[$row_index]);

Posted the issue here.
I posted the issue here: https://github.com/libwww-perl/WWW-Mechanize/issues/362

But answers and wisdom usually come faster in the Monastery ;-)

If anyone knows a cheaper way to get the TDs of a TD using WWW::Mechanize::Chrome or have any other suggestion pls. do tell !

TIA!

--
Alex

Replies are listed 'Best First'.
Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by marto (Cardinal) on Nov 25, 2022 at 13:08 UTC

    "Speculating: it would seem that relative to node xpath has some bug and may be scanning the whole table each time." Did you test this hypothesis? Do you have an example URL? If you don't need JavaScript you could benchmark alternatives such as Mojo::UserAgent.

      Here is a simple timing code to replicate the issue.

      I couldn't find any large tables in public websites but I found one in Wikipedia with 162 rows that illustrates the problem.
      If you find one with 400+ you'll see it takes 3-4 seconds for obtaining the TDs of a TR.

      #!/usr/bin/env perl use strict; use warnings; use feature qw(say); no warnings qw(experimental); use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; use Time::HiRes qw( gettimeofday tv_interval ); my $debug = 0; my ($t0, $elapsed); Log::Log4perl->easy_init($ERROR); my $mech = WWW::Mechanize::Chrome->new( headless => 0, autodie => 0, autoclose => 0 ); $mech->get('https://meta.wikimedia.org/wiki/Wikipedia_article_depth'); sleep(2); my @nodes = $mech->xpath('//table'); $t0 = [gettimeofday]; my @rows = $mech->xpath('.//tr', node => $nodes[3]); say 'xpath for TR tooK:'.tv_interval ( $t0 ); my @cell_keys = ( ); my @table_data = ( ); say "Timing for $#rows rows."; foreach my $row_index (0 .. $#rows) { my %row_data = ( ); # column names if($row_index == 0){ $t0 = [gettimeofday]; my @cells = $mech->xpath('.//th', node => $rows[$row_index]); say 'xpath for TH tooK:'.tv_interval ( $t0 ); foreach (0 ... $#cells) { say "HEADER CELL: $_, VALUE:".$cells[$_]->get_text() if $d +ebug; push @cell_keys, $cells[$_]->get_text(); } if($debug) { say 'Column Names:'; say $_ foreach @cell_keys; } } # data row else{ $t0 = [gettimeofday]; my @cells = $mech->xpath('.//td', node => $rows[$row_index]); say 'xpath for TD tooK:'.tv_interval ( $t0 ); say "DATA ROW: $row_index" if $debug; foreach (0 ... $#cells) { say "DATA CELL: $_, VALUE:" . $cells[$_]->get_text() if $d +ebug; $row_data{ $cell_keys[$_] } = $cells[$_]->get_text(); } push @table_data, \%row_data; if($debug) { say 'Column Data:'; say $row_data{$_} foreach @cell_keys; } } } say Dumper(@table_data) if $debug;

      Here are the results:

      > If you don't need JavaScript

      Even if ...

      supposing communication overhead or an implementation loop are causing a bottleneck ...

      ... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        " ... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent"

        I've used this work around in the past for things that need special sign in or bounce back things that aren't being detected as a 'real' browser, purely so I don't have to do a lot of code changes :) As the location of the bottleneck is not yet understood this may not resolve the issue of performance.

        I will try this, thank you!

        I noticed that fetching the TRs of the table seems pretty fast with WWW::Mechanize::Chrome and xpath. What's seems absurd is that fetching the TDs relative to a single TR takes so long, and the time is proportional to the number of total TRs. That doesn't make any sense unless there's a bug somewhere in WWW::Mechanize::Chrome xpath implementation.

      No, I haven't done more than simple measurements to pinpoint the delays in my own code. But because the delays in fetching the TDs in the context of a specific TR node are proportional (or maybe exponential) to the amount of TRs, it seems obvious that there's either a bug, or some intrinsic limitation in the way that xpath is implemented (e.g. re-parsing the whole page every time).

        Benchmarking within Chrome developer tools should point you in the right direction, if this is way faster than the perl code running your script in the debugger should quickly let you know if this is a limitation or bug with the perl module.

Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by LanX (Sage) on Nov 26, 2022 at 13:58 UTC
    I had a quick glimpse into the docs of ->xpath

    and found this passages and emphasized two parts

      $mech->xpath( $query, %options )

      • my $link = $mech->xpath('//a[id="clickme"]', one => 1);
        # croaks if there is no link or more than one link found
      • my @para = $mech->xpath('//p');
        # Collects all paragraphs
      • my @para_text = $mech->xpath('//p/text()', type => $mech->xpathResult('STRING_TYPE'));
        # Collects all paragraphs as text
      ...
      • node - node relative to which the query is to be executed. Note that you will have to use a relative XPath expression as well. Use

        .//foo

        instead of

        //foo

        Querying relative to a node only works for restricting to children of the node, not for anything else. This is because we need to do the ancestor filtering ourselves instead of having a Chrome API for it.

    two insights into potential bottlenecks so:

    • the module has to identify the parent itself, instead of assembling an xpath. Putting all into one path by yourself might be far more efficient (and probably your identifier is not as unambiguous as you thought)
    • you might get expensive wrapper objects for each result, unless you specify a type of text

    Of course this is all speculation as long as you can't provide an SSCCE ... :)

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      After adding HTML::Tree and parsing some stuff in pure Perl land I think that IS actually the right approach:

      1. Use W::M::Chrome for JS rendering, JS interactions and high-level xpath
      2. Slurp HTML chunks and process in the Perl side as much as possible

        That's one approach.

        But as I said I think putting the logic into a more elaborate xpath to do the heavy lifting inside the browser would fix your performance issue without needing HTML::Tree

        IMHO your code will force the Perl part in W:M:C to do a lot of own filtering and create thousands of proxy objects. These Perl objects will also tunnel requests back and forth to the browser for most method calls.

        Hence many potential bottlenecks.

        update

        as an illustration, this xpath in chrome's dev console for https://meta.wikimedia.org/wiki/Wikipedia_article_depth returns 1016 strings at once

        //table[3]//tr//td//text()

        Disclaimer: I don't have W:M:C installed and my xpath foo is rusted, so I'm pretty sure there are even better ways to do it.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by ait (Hermit) on Nov 25, 2022 at 21:18 UTC

    Thank you all, as always, for you valuable input and ideas! Ye monks are a smart bunch.

    As much as I'd love to help debug W::M::Chrome, I have a short deadline so I decided to use LanX's idea to use xpath to get the table node and the HTML content and then parse that in Perl land. I decided to use HTML::Tree which is simple and tried.

    For anyone having a similar issue, here is the code I wrote for this (assuming it has thead, th, and tbody, YMMV):

    my @nodes = $mech->xpath('//table'); my @data = parse_table($nodes[0]); sub parse_table ($table_node){ my $root = HTML::TreeBuilder->new_from_content($table_node->get_at +tribute('outerHTML')); my @tparts = $root->find_by_tag_name('table')->content_list; my @colnames = ( ); my @data; foreach my $tpart (@tparts){ if($tpart->tag eq 'thead'){ my @rows = $tpart->content_list; foreach my $row (@rows) { if($row->tag eq 'tr'){ my @cells = $row->content_list; # assumes no TH is empty (see below safeguard for +data cells) foreach (@cells) { push @colnames, $_->content->[0]; } } } } elsif($tpart->tag eq 'tbody'){ my @rows = $tpart->content_list; foreach my $row (@rows) { my %row_data = (); if($row->tag eq 'tr'){ my @cells = $row->content_list; foreach (0..$#cells) { # HTML::Element's content method weirdness if($cells[$cell]->content && scalar(@{$cells[$ +cell]->content})){ $row_data{ $colnames[$cell] } = $cells[$ce +ll]->content->[0]; } else{ $row_data{ $colnames[$cell] } = ''; } } } push @data, \%row_data; } } } return \@data; }

    Thanks again y'all !
    --
    Alex

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148377]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2023-02-05 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (31 votes). Check out past polls.

    Notices?