ttlgreen has asked for the wisdom of the Perl Monks concerning the following question:

Greetings perl monks!

I'm in the process of writing my first script that makes use of sockets and it's SLOW. All it does is connect to a webserver, login if necessary and fetch the requested page.

The point was to stop using $somevar = system('lynx -dump http://thesite.com'); because of the unnecessary overhead of calling lynx to make a simple GET request. Unfortunatly it seems like lynx is actually faster! Am I doing something dumb here that's causing this?

my $server = "theSite.com"; my $proto = getprotobyname("tcp") || 6; my $port = getservbyname("http", "tcp") || 80; my $packed_remote_ip = inet_aton($server); my $s_server = sockaddr_in($port, $packed_remote_ip); $req1="GET /board/ HTTP/1.0\r\n"; $req1 .= "Host: $server\r\n"; $req1 .= "Accept: text/html, text/plain, text/css, text/sgml, */*; +q=0.01\r\n"; $req1 .= "Accept-Language: en\r\n"; $req1 .= "User-Agent: Lynx/2.8.6rel.4 libwww-FM/2.14 SSL-MM/1.4.1 +OpenSSL/0.9.8i\r\n"; $req1 .= 'Cookie2: $Version="1"' . "\r\n"; # I have no idea what t +his is for but lynx sends it so.. so am I! $req1 .= "\r\n"; socket(SK_CLIENT, PF_INET, SOCK_STREAM, $proto) || die("Socket Err +or req1: $! \n"); connect(SK_CLIENT,$s_server) || die("Connect Error: $! \n"); send(SK_CLIENT,$req1,0) || print "send error: $! \n"; $data = 1; while ($data) { recv(SK_CLIENT, $data, 100, 0); $thePage .= $data; } close SK_CLIENT || die("Close error: $!"); print "$thePage\n";

Replies are listed 'Best First'.
Re: My first socket program is SLOW?
by BrowserUk (Patriarch) on Jan 14, 2009 at 23:23 UTC

    If the page you're fetching is largish, reading it in 100 byte chunks and building it up by repeated concatentation:

    while ($data) { recv(SK_CLIENT, $data, 100, 0); $thePage .= $data; }

    could be a part of your problem. Something like

    1 while read( SK_CLIENT, $thePage, 4096, length( $thePage ) );

    might run a little more quickly.

    Better still might be to read the first few lines of the response line by line and look for the Content-length header, and then read the rest in one go.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: My first socket program is SLOW?
by marto (Cardinal) on Jan 14, 2009 at 23:17 UTC

    How slow is 'SLOW'? Is the only purpose of your program to print the requested page?

    If you are not married to using sockets the following example uses the LWP::Simple module to do a similar thing:

    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; my $doc = get 'http://www.perl.org'; print $doc;

    Of course, this is just a very basic example to get you started. In addition to the LWP documentation, see LWP Cookbook for further information and examples.

    Other popular modules along these lines include WWW::Mechanize, which may be of interest, depending on what exactly you want to do.

    Hope this helps,

    Martin

Re: My first socket program is SLOW?
by gone2015 (Deacon) on Jan 15, 2009 at 00:55 UTC

    As brother marto says... how slow is it in comparison ?And what total time is involved here ?

    I'm trying to think how the request/response you are doing by hand would differ from what Lynx would be doing... lookup DNS, open TCP connection, send request, collect reply, close TCP connection. It's hard to see what could be different -- assuming the request is the same. I would expect the time to be dominated by network operations and server load, unless the server is very local (on the same machine or in the LAN, perhaps).

    So, are you sure that the Perl code and Lynx are performing markedly differently under roughly the same network conditions, and with the server roughly equally loaded ? I'd test both methods in quick succession to try to eliminate that sort of variation (a number of times). If the DNS lookup is a significant part of the time, a ping or something just before that test should cause any cache to be filled, minimising the variability cause by DNS.

    If there's still a definite speed difference, it's possible that the Lynx is being clever. I would tcpdump what Lynx and what the piece of Perl does, and see if Lynx is getting the TCP Window to open larger or faster. Or try strace to see how sockets are being driven differently.

      Thanks for the replies everyone... (I wish there was a button on here to reply to everyone at once ;)

      It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert.

      I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another. If that can be done it would be plenty fast enough for me.

      I've been looking all over for a perl equivalent to putting "&" after a command in bash but I can't find anything other than the fork function.

      Basically I've got a function in a module I wrote called getPage($url) that fetches the page. Is there any way to background each getPage() call so they all run at once?

      I would just do "lynx -dump $url &" but I'm trying to make this a portable script that doesn't require a bunch of linux programs to function properly. I'm pretty new to perl still so I'm sure there must be some way I just don't know of.

        I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another.

        It's simple with threads:

        #! perl -slw use strict; use threads; use threads::shared; use LWP::Simple; my @urls = qw[ http://news.bbc.co.uk/1/hi/default.stm http://q-lang.sourceforge.net/qdoc.html file://localhost/c:/perl/html/index.html http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html ]; my @threads; my %results :shared; for my $url ( @urls ) { push @threads, async{ printf "%d : fetching $url\n", threads->tid; $results{ $url } = get $url and printf "%d : got %d bytes\n", threads->tid, length $results{ $url } or warn "failed to get $url"; }; } print "Waiting for threads"; $_->join for @threads; print "threads completed"; ### process the content here. print "$_ : ", length $results{ $_ } for keys %results;

        Output:

        >t-get.pl 1 : fetching http://news.bbc.co.uk/1/hi/default.stm 2 : fetching http://q-lang.sourceforge.net/qdoc.html 3 : fetching file://localhost/c:/perl/html/index.html Waiting for threads 4 : fetching http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lam +bda.pm 5 : fetching http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/f +uncional/L-99_Ninety-Nine_Lisp_Problems.html 3 : got 378 bytes 5 : got 56912 bytes 4 : got 65372 bytes 1 : got 76380 bytes 2 : got 1244518 bytes threads completed http://news.bbc.co.uk/1/hi/default.stm : 76380 http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html : 56912 http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm : 6537 +2 file://localhost/c:/perl/html/index.html : 378 http://q-lang.sourceforge.net/qdoc.html : 1244518

        Though if the list of urls to fetch grows much beyond a dozen or so, you'd want a slightly more sophistcated version that metered the number of concurrent gets and re-used the threads. That's also quite simple to write, but requires a little more thought.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Well you have multiple options. First there is the possibility of forking. Secondly you can do a bunch of system commands with &s then proceed when wait says that they are done. Thirdly you can use asynchronous IO, which is basically having multiple handles open, then using select to read from them one at a time.

        If you want to follow the latter approach there is a discussion of how to do it in raw code in perlipc. Life may become easier (at least once you've learned the library) if you use a library that is meant to support asynchronous IO. CPAN has many such libraries. Including POE, Event::Lib, and the new kid on the block, IO::Lambda. If you want to look at the last you probably want to look at the discussion about it at IO::Lambda: call for participation, and particularly my response at Re^2: regarding 1.02 (was Re: IO::Lambda: call for participation).

        It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert.

        The biggest win will be from fetching things in parallel, so you may be leaving this as an unsolved puzzle...

        ...in case it's still of interest, apart from the timing of packets the key thing to look at is the Window Size. If the receiver window size is small, then the sender may delay sending packets.

        This is what Wireshark shows for a trivial example TCP session:

          No Time     Src  Dest   P Info
           1 00.884 ..130 ..141 TCP ..4 > ..1 [SYN] Seq=0 Len=0 MSS=1460 WS=6
           2 00.884 ..141 ..130 TCP ..1 > ..4 [SYN, ACK] Seq=0 Ack=1 Win=60984 Len=0 MSS=1452 WS=0
           3 00.884 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=1 Ack=1 Win=5888 Len=0
           4 00.892 ..130 ..141 TCP ..4 > ..1 [PSH, ACK] Seq=1 Ack=1 Win=5888 Len=87
           5 00.893 ..130 ..141 TCP ..4 > ..1 [FIN, PSH, ACK] Seq=88 Ack=1 Win=5888 Len=90
           6 00.893 ..141 ..130 TCP ..1 > ..4 [ACK] Seq=1 Ack=179 Win=60807 Len=0
           7 01.000 ..141 ..130 TCP ..1 > ..4 [FIN, ACK] Seq=1 Ack=179 Win=60807 Len=0
           8 01.000 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=179 Ack=2 Win=5888 Len=0
        
        where I've edited the time, IP addresses and port numbers in the interests of conciseness. This shows:

        1. ..130 opening a TCP converstaion with ..141, note the "Window Scaling" WS=6.

        2. ..141 acknowledging the TCP open, returning a "Window Size" of Win=60984 and specifying no "Window Scaling" WS=0.

        3. ..130 completing the TCP open, and returning a "Window Size" of Win=5888 -- which, given ..130's declared scaling of WS=6, means its Window Size is 5888 * 2**6 = 376832 (!)

        4. etc. the rest of the TCP conversation -- noting that all Window Sizes returned by ..130 must be multiplied up by its declared Window Scaling.

        The thing to look for is the Window Size being advertised by your machine. If this reduces to something small when using Perl, but not when using Lynx, then that may be the problem. Mind you, these days Window Sizes are pretty big !

        The other thing to look for would be your machine being slower to acknowledge stuff when using Perl, or acknowledging smaller amounts each time than when using Lynx.