Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Question: Fast way to validate 600K websites

by lihao (Monk)
on May 12, 2008 at 16:40 UTC ( [id://686105]=perlquestion: print w/replies, xml ) Need Help??

lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks:

I have a list of 600K input websites which need to be validated, currently I am using LWP::Simple's head method, like:

perl -MLWP::Simple -lne' my $ret = head("http://$_"); if (not $ret) { print "NOT\t$_" } else { print "OK\t$_"; } ' list.dat | tee results.txt

which works but very slow. Are there better ways to handle this(perl/wget/lynx/curl)? And I've noticed that LPW::Simple::head also returns "0.00", "00.00" as valid addresses although I got nothing from my Browser with these inputs. Are they treated as localhost? what are the rules for this then?

Many thanks

lihao

Replies are listed 'Best First'.
Re: Question: Fast way to validate 600K websites
by grinder (Bishop) on May 12, 2008 at 17:34 UTC

    I've encountered dynamic websites that only implement GET (or rather, they neglect to implement HEAD). Also, what do you mean by "validated"? Someone's listening on port 80? The server returns a 2xx return code? The server is running HTTP 1.1?

    No matter what technique you adopt, you'll find most of your time is spent waiting for the network socket to be established. You will gain a lot by setting up a farm of workers to handle connections in parallel. Parallel::ForkManager is one way, but you'll probably get more mileage from LWP::Parallel.

    Oh, and I don't understand your question about 0.00. Can you post a snippet demonstrating the problem?

    update: I just thought of another thing, this reminds of something I once wrote. The first step is to see whether the host itself is still around. Extract the host name from the URI and see if you can resolve its address (you shall have to do this anyway, with a bit of luck you'll warm up your DNS cache regardless). If you can't even resolve the address to an A or CNAME record, there's no point even trying to fetch the page.

    Normally you'll get back a negative response from a DNS server much faster than a putative web server that's just not there any more.

    • another intruder with the mooring in the heart of the Perl

Re: Question: Fast way to validate 600K websites
by BrowserUk (Patriarch) on May 12, 2008 at 18:35 UTC

    Vary -N=nn to suit your bandwidth:

    #! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; our $N ||= 10; my $Q = new Thread::Queue; my @pool = map async{ print "$_ :", head( $_ ) ? 'ok' : 'not ok' while $_ = $Q->dequeue; }, 1 .. $N; while( <> ) { chomp; $Q->enqueue( $_ ); } $Q->enqueue( (undef) x $N ); $_->join for @pool; __END__ C:\test>headUrls.pl -N=20 urls.txt http://www.shops-gifts.shopiwon.com/ :not ok http://1ezbiz.leadsomatic.com :ok http://Indserve.com/kids :not ok http://16066.profitmatic.com :ok http://1-family.com/office/web/tp514/Boats.shtml :ok http://1mboard.proboards28.com/index.cgi :ok http://1plus-longdistance.com/domain/ :ok http://1stopsquare.com/101xyron.html :ok http://1world.leadsomatic.com :ok http://1stphoenix.veretekk.com/index.html :ok http://1stphoenix.veretekk.com :ok http://1bernard.veremail.com/index.html :ok

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi, huys:

      Thank you all for the helpful suggestions:-) I am actually trying to check if 600K listed domain names are reachable. many of them are just garbages like 0.00, hotmailll.com. so I need to discard them(like 000.0.com) or correct them(i.e. from 'hotmaillll.com' to 'hotmail.com'). Right now I have not yet consider sites which disable 'HEAD' method. at this stage, I will just filter out those 'NOT valid' sites into a list and then do more search on that smaller list. :)Most of the information I got so far from this thread is very helpful, thanks again: )

      lihao

        If you want to know if the uri is actually reachable, would simple posix 'ping' help you?

      merlyn pointed out years ago that the quickest way to do the actual fetch is to connect a socket on port 80, print a simple "GET / HTTP/1.0\n\n" to the socket, then just read the first x bytes (enough to check for a 200 OK) and disconnect. This saves the data/time overhead of fetching the full page, and also prevents issues with sites that don't give HEAD :-)

        Sounds plausible at first, but the time taken to read (most) head request contents, pales into insignificance with the time taken to make the connection and transmit it in the first place. That is, all you are saving by stopping reading early is avoiding the transfer of data from the local tcpip buffer stack into your own process memory.

        The full content has already been transmitted. Your local system has already had to responded to the device interrupts. And the local tcpip buffers have already been allocated to accommodate it. Even if the remote server actually wrote the 200 OK as a separate write to the outgoing socket, the tcpip layer at that end will probably delay its transmission until it has enough to fill a standard transmission buffer full (1536 bytes or some such?).

        So no, I seriously doubt that you'd save much time doing it this way except for the rare instances where the http server is running in the same box, or the content of the head request was in the order of 100s of kbytes.

        Besides which, the major delays when doing this task serially are when the DNS lookup fails, or the server doesn't exist and you fall back on tcp timeouts before moving on. Saving reading a few bytes will be neither here nor there in comparison with network delays and timeouts.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Question: Fast way to validate 600K websites
by starbolin (Hermit) on May 12, 2008 at 17:41 UTC

    It's not the tool; it's the structure. If you have to wait for each site to respond before doing a GET on the next site it's going to take forever. You need a way to issue a block of GETs, forking a child to handle each issue, then process the ones that respond, and issue new GETs as processes are freed up. Perhaps read perlipc. There are some good tools out there to make this kind of thing more robust ( if not less painfull ), see: POE

    Update: I'm wrong, the module you want is LWP::Parallel as grinder points out. The module documentation even provides the code you want.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
Re: Question: Fast way to validate 600K websites
by pc88mxer (Vicar) on May 12, 2008 at 17:26 UTC
    I would try parallelizing the process. Even running two processes on the same machine might help. Just divide the list.dat file into two, and try this:
    perl -MLWP::Simple -lne '...' list-part1.dat > results-1.txt & perl -MLWP::Simple -lne '...' list-part2.dat > results-2.txt &
    I have a feeling a lot of the running time is being spent waiting for the remote sites to accept the connection and respond.
Re: Question: Fast way to validate 600K websites
by derby (Abbot) on May 12, 2008 at 17:56 UTC

    Wow! 6OOK ... this is going to saturate someone's bandwidth. By validate, I'm assuming you just want to ensure the urls are still active. There's a lot of reasons why a url that doesn't resolve right now may resolve 5 minutes from now so there's no real way to do this 100% (but hey, that's the nature of TCP/IP). One way to do it is to look at Parallel::ForkManager - it actually uses this scenario in it's documentation (although, you may want to just do a get instead of a getstore).

    -derby
Re: Question: Fast way to validate 600K websites
by dragonchild (Archbishop) on May 12, 2008 at 17:20 UTC
    What part is slow? Is it the code within LWP::Simple or the fact that you're going out over the net and doing something? Is this something you can fork?

    As for the quickest way to see what's on port 80, it's telnet using the HEAD command.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://686105]
Approved by pc88mxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-24 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found