My first socket program is SLOW?

ttlgreen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: My first socket program is SLOW? by BrowserUk (Patriarch) on Jan 14, 2009 at 23:23 UTC
If the page you're fetching is largish, reading it in 100 byte chunks and building it up by repeated concatentation: `while ($data) { recv(SK_CLIENT, $data, 100, 0); $thePage .= $data; }` [download] could be a part of your problem. Something like `1 while read( SK_CLIENT, $thePage, 4096, length( $thePage ) );` [download] might run a little more quickly. Better still might be to read the first few lines of the response line by line and look for the Content-length header, and then read the rest in one go. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: My first socket program is SLOW? by marto (Cardinal) on Jan 14, 2009 at 23:17 UTC
How slow is 'SLOW'? Is the only purpose of your program to print the requested page? If you are not married to using sockets the following example uses the LWP::Simple module to do a similar thing: `#!/usr/bin/perl use strict; use warnings; use LWP::Simple; my $doc = get 'http://www.perl.org'; print $doc;` [download] Of course, this is just a very basic example to get you started. In addition to the LWP documentation, see LWP Cookbook for further information and examples. Other popular modules along these lines include WWW::Mechanize, which may be of interest, depending on what exactly you want to do. Hope this helps, Martin	[reply] [d/l]
Re: My first socket program is SLOW? by gone2015 (Deacon) on Jan 15, 2009 at 00:55 UTC
As brother marto says... how slow is it in comparison ?And what total time is involved here ? I'm trying to think how the request/response you are doing by hand would differ from what Lynx would be doing... lookup DNS, open TCP connection, send request, collect reply, close TCP connection. It's hard to see what could be different -- assuming the request is the same. I would expect the time to be dominated by network operations and server load, unless the server is very local (on the same machine or in the LAN, perhaps). So, are you sure that the Perl code and Lynx are performing markedly differently under roughly the same network conditions, and with the server roughly equally loaded ? I'd test both methods in quick succession to try to eliminate that sort of variation (a number of times). If the DNS lookup is a significant part of the time, a ping or something just before that test should cause any cache to be filled, minimising the variability cause by DNS. If there's still a definite speed difference, it's possible that the Lynx is being clever. I would `tcpdump` what Lynx and what the piece of Perl does, and see if Lynx is getting the TCP Window to open larger or faster. Or try `strace` to see how sockets are being driven differently.	[reply] [d/l] [select]
Re^2: My first socket program is SLOW? by ttlgreen (Sexton) on Jan 15, 2009 at 03:18 UTC
Thanks for the replies everyone... (I wish there was a button on here to reply to everyone at once ;) It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert. I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another. If that can be done it would be plenty fast enough for me. I've been looking all over for a perl equivalent to putting "&" after a command in bash but I can't find anything other than the fork function. Basically I've got a function in a module I wrote called getPage($url) that fetches the page. Is there any way to background each getPage() call so they all run at once? I would just do "lynx -dump $url &" but I'm trying to make this a portable script that doesn't require a bunch of linux programs to function properly. I'm pretty new to perl still so I'm sure there must be some way I just don't know of.	[reply]
Re^3: My first socket program is SLOW? (threads) by BrowserUk (Patriarch) on Jan 15, 2009 at 09:46 UTC
I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another. It's simple with threads: #! perl -slw use strict; use threads; use threads::shared; use LWP::Simple; my @urls = qw[ http://news.bbc.co.uk/1/hi/default.stm http://q-lang.sourceforge.net/qdoc.html file://localhost/c:/perl/html/index.html http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html ]; my @threads; my %results :shared; for my $url ( @urls ) { push @threads, async{ printf "%d : fetching $url\n", threads->tid; $results{ $url } = get $url and printf "%d : got %d bytes\n", threads->tid, length $results{ $url } or warn "failed to get $url"; }; } print "Waiting for threads"; $_->join for @threads; print "threads completed"; ### process the content here. print "$_ : ", length $results{ $_ } for keys %results; [download] Output: >t-get.pl 1 : fetching http://news.bbc.co.uk/1/hi/default.stm 2 : fetching http://q-lang.sourceforge.net/qdoc.html 3 : fetching file://localhost/c:/perl/html/index.html Waiting for threads 4 : fetching http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lam +bda.pm 5 : fetching http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/f +uncional/L-99_Ninety-Nine_Lisp_Problems.html 3 : got 378 bytes 5 : got 56912 bytes 4 : got 65372 bytes 1 : got 76380 bytes 2 : got 1244518 bytes threads completed http://news.bbc.co.uk/1/hi/default.stm : 76380 http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html : 56912 http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm : 6537 +2 file://localhost/c:/perl/html/index.html : 378 http://q-lang.sourceforge.net/qdoc.html : 1244518 [download] Though if the list of urls to fetch grows much beyond a dozen or so, you'd want a slightly more sophistcated version that metered the number of concurrent gets and re-used the threads. That's also quite simple to write, but requires a little more thought. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: My first socket program is SLOW? by tilly (Archbishop) on Jan 15, 2009 at 03:29 UTC
Well you have multiple options. First there is the possibility of forking. Secondly you can do a bunch of system commands with &s then proceed when wait says that they are done. Thirdly you can use asynchronous IO, which is basically having multiple handles open, then using select to read from them one at a time. If you want to follow the latter approach there is a discussion of how to do it in raw code in perlipc. Life may become easier (at least once you've learned the library) if you use a library that is meant to support asynchronous IO. CPAN has many such libraries. Including POE, Event::Lib, and the new kid on the block, IO::Lambda. If you want to look at the last you probably want to look at the discussion about it at IO::Lambda: call for participation, and particularly my response at Re^2: regarding 1.02 (was Re: IO::Lambda: call for participation).	[reply]
Re^3: My first socket program is SLOW? by marto (Cardinal) on Jan 15, 2009 at 10:00 UTC
"I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's" The general advice is not to use a regex to parse/manipulate HTML/XML but use one of the parsing modules on cpan. For HTML look at modules such as HTML::Parses and HTML::TokeParser, also see HTML::TokeParser help - parsing headlines or use super search to find more examples. Hope this helps Martin	[reply]
Re^4: My first socket program is SLOW? by ttlgreen (Sexton) on Jan 15, 2009 at 21:43 UTC
Re^5: My first socket program is SLOW? by marto (Cardinal) on Jan 16, 2009 at 11:06 UTC
Re^3: My first socket program is SLOW? by gone2015 (Deacon) on Jan 15, 2009 at 23:53 UTC
It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert. The biggest win will be from fetching things in parallel, so you may be leaving this as an unsolved puzzle... ...in case it's still of interest, apart from the timing of packets the key thing to look at is the Window Size. If the receiver window size is small, then the sender may delay sending packets. This is what Wireshark shows for a trivial example TCP session: No Time Src Dest P Info 1 00.884 ..130 ..141 TCP ..4 > ..1 [SYN] Seq=0 Len=0 MSS=1460 WS=6 2 00.884 ..141 ..130 TCP ..1 > ..4 [SYN, ACK] Seq=0 Ack=1 Win=60984 Len=0 MSS=1452 WS=0 3 00.884 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=1 Ack=1 Win=5888 Len=0 4 00.892 ..130 ..141 TCP ..4 > ..1 [PSH, ACK] Seq=1 Ack=1 Win=5888 Len=87 5 00.893 ..130 ..141 TCP ..4 > ..1 [FIN, PSH, ACK] Seq=88 Ack=1 Win=5888 Len=90 6 00.893 ..141 ..130 TCP ..1 > ..4 [ACK] Seq=1 Ack=179 Win=60807 Len=0 7 01.000 ..141 ..130 TCP ..1 > ..4 [FIN, ACK] Seq=1 Ack=179 Win=60807 Len=0 8 01.000 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=179 Ack=2 Win=5888 Len=0 where I've edited the time, IP addresses and port numbers in the interests of conciseness. This shows: ..130 opening a TCP converstaion with ..141, note the "Window Scaling" `WS=6`. ..141 acknowledging the TCP open, returning a "Window Size" of `Win=60984` and specifying no "Window Scaling" `WS=0`. ..130 completing the TCP open, and returning a "Window Size" of `Win=5888` -- which, given ..130's declared scaling of `WS=6`, means its Window Size is 5888 * 2**6 = 376832 (!) etc. the rest of the TCP conversation -- noting that all Window Sizes returned by ..130 must be multiplied up by its declared Window Scaling. The thing to look for is the Window Size being advertised by your machine. If this reduces to something small when using Perl, but not when using Lynx, then that may be the problem. Mind you, these days Window Sizes are pretty big ! The other thing to look for would be your machine being slower to acknowledge stuff when using Perl, or acknowledging smaller amounts each time than when using Lynx.	[reply] [d/l] [select]