Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Timing web page download.

by Eyck (Priest)
on Jul 11, 2012 at 08:07 UTC ( [id://981041]=perlquestion: print w/replies, xml ) Need Help??

Eyck has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to immitate web browser downloading some page, I have an array containing all components, and then use WWW::Mechanize to do downloading:
use WWW::Mechanize; use Time::HiRes qw(time); my $links=[ "http://web-page.to.download.to/", "http://static.to.download.to/background.jpg", "http://static.to.download.to/first.css", "http://www.google-analytics.com/ga.js", "http://static.ak.fbcdn.net/rsrc.php/v2/yl/r/6KM-54hh6R2.css", ]; my $start=time; foreach (@$links) { $mech->get($_); }; my $stop=time;

This works more-or-less the way I intended, there are two problems though - since the list of links is dynamic, and partly created using javascript, I had to use the browser to create that list.

I need a way of parsing web page, and getting a list of all its component, and this is my first problem.

The other problem is that I'm serializing all downloading here - I should be using something more similiar to what browsers do - maybe use 4 concurrent downloaders?

How can I emulate 4 concurrent downloading threads?

Replies are listed 'Best First'.
Re: Timing web page download.
by Anonymous Monk on Jul 11, 2012 at 08:32 UTC
Re: Timing web page download.
by tospo (Hermit) on Jul 11, 2012 at 08:25 UTC
    would it not be much easier to just use Firebug for that (provided you use Firefox)?
      I'm running this on ARM, MIPS, small VMs on Intel/AMD and Power, it seems silly to me that I would be required to install X11 and firefox just to be able to download a page and it's components.

        it seems silly to me that I would be required to install X11 and firefox just to be able to download a page and it's components.

        wget, lwp-rget, dot dot dot

Re: Timing web page download.
by phatWares (Initiate) on Jul 11, 2012 at 12:07 UTC

    First off, I am NO monk. But I did come into this problem a few years ago when I was building a fusker. What I ultimately did was use Parallel::Simple to branch into 4 threads with each thread using WWW:Mechanize to do the work. Since I'm on Win32 (no real pipes so to speak), I hacked up some thread control using temp files. The temp files can pass URLs to the WWW::Mechanize objects, and they can even be used to pass signals. You get into some elegant issues with concurrency and file-locks, but it CAN work.

    Anyway, that was my two cents. Good luck on your project!

      thanks, it looks like there is no good solutions for this.
Re: Timing web page download.
by Sinistral (Monsignor) on Jul 11, 2012 at 13:19 UTC

    Based on your requirements of getting all resources for a page that can be created in JavaScript, I'd use something that handles JavaScript well on a headless server: Node.js. That engine has the capabilities you need, and NPM means someone has already written what you seek (especially since Node is used for testing of JavaScript web apps). However, your alternative processor types means you'll have to figure out if you can compile it from source instead of using a Windows, or Mac OS binary.

      Thanks for suggestion, but can you point out the NPM you have on mind that does something similiar?

      I'm thinking that if I'm to parse web page, then it doesn't matter if I write the parser in perl,c or js, in fact it would be harder to do in js, unless you're suggesting compiling it with nodejs and then running that foreign code in server context.

      In what way is JS better for parsing html then perl?

        The most likely candidate NPM seems like it might be jscrape, which combines jsdom, request, and jquery. The reason I recommended Javascript / Node as an option is your own language:

        This works more-or-less the way I intended, there are two problems though - since the list of links is dynamic, and partly created using javascript, I had to use the browser to create that list.

        I need a way of parsing web page, and getting a list of all its component, and this is my first problem.

        If you are dealing with pages that use Javascript to dynamically load resources, then you have to have something that can interpret that Javascript as a browser would.

        As something completely different, you might want to check out Selenium.

Re: Timing web page download.
by mrguy123 (Hermit) on Jul 11, 2012 at 15:24 UTC
    For concurrent downloading, I recommend:
    Parallel::ForkManager

    Good luck
    Mister Guy

    Everybody seems to think I'm lazy
    I don't mind, I think they're crazy
Re: Timing web page download.
by sundialsvc4 (Abbot) on Jul 12, 2012 at 12:48 UTC

    I think that the best way to estimate the page-load time of a web page is through simulation, or even deduction, not experiment.   Experimental results are too-heavily biased by the properties of the network you’re on ... which, in the case of in-house testing, is much too fast.   (Developers always have the fastest machines.)   The behavior of modern-day AJAX-driven web sites makes it difficult to produce truly-useful experimental results.

    There are two aspects to the problem:   transmission time (as heavily affected by cacheing), and on-client processing time as done by javascript and as perceived by the user who is looking only at the screen display.   Since the latter is probably going to be more-or-less the same on any machine, transmission time is your singular biggest wait-time component, and browser caching behavior is your single biggest determinant of that.:   the size of the files, the number of files (line turnarounds), and the probability of a cache-hit (I/O avoidance).

    One of the best and cheapest performance-improvements I managed to pull off, that really made a difference, was to observe (with Firebug) that a lot of pages in one site had been generated originally using Microsoft Word, and that this particular version of Word had generated a separate (but identical) image for every bullet and even for horizontal lines.   Even though the image content was identical, a separate file-name had been generated, hence there were many dozens of downloads of the same information ... and many duplicate copies of this data in the database(!) that served them.   It also served to “flood out” the client-side cache so that it wound up being full of lots of copies of these images (which would never be referred-to again).   A Perl script to locate and consolidate the identical images, then pass through th HTML to substitute file-names, had an enormous positive impact on the whole shebang.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://981041]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2024-03-28 18:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found