Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR

by marto (Cardinal)
on Nov 25, 2022 at 13:08 UTC ( [id://11148378]=note: print w/replies, xml ) Need Help??


in reply to WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR

"Speculating: it would seem that relative to node xpath has some bug and may be scanning the whole table each time." Did you test this hypothesis? Do you have an example URL? If you don't need JavaScript you could benchmark alternatives such as Mojo::UserAgent.

Replies are listed 'Best First'.
Re^2: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by LanX (Saint) on Nov 25, 2022 at 13:32 UTC
    > If you don't need JavaScript

    Even if ...

    supposing communication overhead or an implementation loop are causing a bottleneck ...

    ... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      " ... he could also try to fetch the whole table as html once using WWW::Mechanize::Chrome and do the parsing with Mojo::UserAgent"

      I've used this work around in the past for things that need special sign in or bounce back things that aren't being detected as a 'real' browser, purely so I don't have to do a lot of code changes :) As the location of the bottleneck is not yet understood this may not resolve the issue of performance.

        > As the location of the bottleneck is not yet understood this may not resolve the issue of performance.

        but may help narrowing down the underlying problem.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

      I will try this, thank you!

      I noticed that fetching the TRs of the table seems pretty fast with WWW::Mechanize::Chrome and xpath. What's seems absurd is that fetching the TDs relative to a single TR takes so long, and the time is proportional to the number of total TRs. That doesn't make any sense unless there's a bug somewhere in WWW::Mechanize::Chrome xpath implementation.

        I can't look into it now, so some general advice

        • Try the Xpath inside the browser's dev console
        • Try logging what mechanize does under the hood.

        Back in the time when I used W:M:FF I was able (and sometimes needed) to send and eval JS and fetch the result as JSON.

        All this will help you specifying a feature request (if needed) for W:M:C

        HTH :)

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re^2: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by ait (Hermit) on Nov 26, 2022 at 14:54 UTC

    Here is a simple timing code to replicate the issue.

    I couldn't find any large tables in public websites but I found one in Wikipedia with 162 rows that illustrates the problem.
    If you find one with 400+ you'll see it takes 3-4 seconds for obtaining the TDs of a TR.

    #!/usr/bin/env perl use strict; use warnings; use feature qw(say); no warnings qw(experimental); use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; use Time::HiRes qw( gettimeofday tv_interval ); my $debug = 0; my ($t0, $elapsed); Log::Log4perl->easy_init($ERROR); my $mech = WWW::Mechanize::Chrome->new( headless => 0, autodie => 0, autoclose => 0 ); $mech->get('https://meta.wikimedia.org/wiki/Wikipedia_article_depth'); sleep(2); my @nodes = $mech->xpath('//table'); $t0 = [gettimeofday]; my @rows = $mech->xpath('.//tr', node => $nodes[3]); say 'xpath for TR tooK:'.tv_interval ( $t0 ); my @cell_keys = ( ); my @table_data = ( ); say "Timing for $#rows rows."; foreach my $row_index (0 .. $#rows) { my %row_data = ( ); # column names if($row_index == 0){ $t0 = [gettimeofday]; my @cells = $mech->xpath('.//th', node => $rows[$row_index]); say 'xpath for TH tooK:'.tv_interval ( $t0 ); foreach (0 ... $#cells) { say "HEADER CELL: $_, VALUE:".$cells[$_]->get_text() if $d +ebug; push @cell_keys, $cells[$_]->get_text(); } if($debug) { say 'Column Names:'; say $_ foreach @cell_keys; } } # data row else{ $t0 = [gettimeofday]; my @cells = $mech->xpath('.//td', node => $rows[$row_index]); say 'xpath for TD tooK:'.tv_interval ( $t0 ); say "DATA ROW: $row_index" if $debug; foreach (0 ... $#cells) { say "DATA CELL: $_, VALUE:" . $cells[$_]->get_text() if $d +ebug; $row_data{ $cell_keys[$_] } = $cells[$_]->get_text(); } push @table_data, \%row_data; if($debug) { say 'Column Data:'; say $row_data{$_} foreach @cell_keys; } } } say Dumper(@table_data) if $debug;

    Here are the results:

Re^2: WWW::Mechanize::Chrome VERY slow on xpath obtaining TDs of a TR
by ait (Hermit) on Nov 25, 2022 at 15:46 UTC

    No, I haven't done more than simple measurements to pinpoint the delays in my own code. But because the delays in fetching the TDs in the context of a specific TR node are proportional (or maybe exponential) to the amount of TRs, it seems obvious that there's either a bug, or some intrinsic limitation in the way that xpath is implemented (e.g. re-parsing the whole page every time).

      Benchmarking within Chrome developer tools should point you in the right direction, if this is way faster than the perl code running your script in the debugger should quickly let you know if this is a limitation or bug with the perl module.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148378]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-03-28 21:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found