in reply to Question: Fast way to validate 600K websites

Vary -N=nn to suit your bandwidth:

#! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; our $N ||= 10; my $Q = new Thread::Queue; my @pool = map async{ print "$_ :", head( $_ ) ? 'ok' : 'not ok' while $_ = $Q->dequeue; }, 1 .. $N; while( <> ) { chomp; $Q->enqueue( $_ ); } $Q->enqueue( (undef) x $N ); $_->join for @pool; __END__ C:\test>headUrls.pl -N=20 urls.txt http://www.shops-gifts.shopiwon.com/ :not ok http://1ezbiz.leadsomatic.com :ok http://Indserve.com/kids :not ok http://16066.profitmatic.com :ok http://1-family.com/office/web/tp514/Boats.shtml :ok http://1mboard.proboards28.com/index.cgi :ok http://1plus-longdistance.com/domain/ :ok http://1stopsquare.com/101xyron.html :ok http://1world.leadsomatic.com :ok http://1stphoenix.veretekk.com/index.html :ok http://1stphoenix.veretekk.com :ok http://1bernard.veremail.com/index.html :ok

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Question: Fast way to validate 600K websites
by lihao (Monk) on May 12, 2008 at 19:35 UTC

    Hi, huys:

    Thank you all for the helpful suggestions:-) I am actually trying to check if 600K listed domain names are reachable. many of them are just garbages like 0.00, hotmailll.com. so I need to discard them(like 000.0.com) or correct them(i.e. from 'hotmaillll.com' to 'hotmail.com'). Right now I have not yet consider sites which disable 'HEAD' method. at this stage, I will just filter out those 'NOT valid' sites into a list and then do more search on that smaller list. :)Most of the information I got so far from this thread is very helpful, thanks again: )

    lihao

      If you want to know if the uri is actually reachable, would simple posix 'ping' help you?
Re^2: Question: Fast way to validate 600K websites
by tachyon-II (Chaplain) on May 13, 2008 at 16:40 UTC

    merlyn pointed out years ago that the quickest way to do the actual fetch is to connect a socket on port 80, print a simple "GET / HTTP/1.0\n\n" to the socket, then just read the first x bytes (enough to check for a 200 OK) and disconnect. This saves the data/time overhead of fetching the full page, and also prevents issues with sites that don't give HEAD :-)

      Sounds plausible at first, but the time taken to read (most) head request contents, pales into insignificance with the time taken to make the connection and transmit it in the first place. That is, all you are saving by stopping reading early is avoiding the transfer of data from the local tcpip buffer stack into your own process memory.

      The full content has already been transmitted. Your local system has already had to responded to the device interrupts. And the local tcpip buffers have already been allocated to accommodate it. Even if the remote server actually wrote the 200 OK as a separate write to the outgoing socket, the tcpip layer at that end will probably delay its transmission until it has enough to fill a standard transmission buffer full (1536 bytes or some such?).

      So no, I seriously doubt that you'd save much time doing it this way except for the rare instances where the http server is running in the same box, or the content of the head request was in the order of 100s of kbytes.

      Besides which, the major delays when doing this task serially are when the DNS lookup fails, or the server doesn't exist and you fall back on tcp timeouts before moving on. Saving reading a few bytes will be neither here nor there in comparison with network delays and timeouts.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
A reply falls below the community's threshold of quality. You may see it by logging in.