Re^2: My first socket program is SLOW?

Replies are listed 'Best First'.
Re^3: My first socket program is SLOW? (threads) by BrowserUk (Patriarch) on Jan 15, 2009 at 09:46 UTC
I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another. It's simple with threads: #! perl -slw use strict; use threads; use threads::shared; use LWP::Simple; my @urls = qw[ http://news.bbc.co.uk/1/hi/default.stm http://q-lang.sourceforge.net/qdoc.html file://localhost/c:/perl/html/index.html http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html ]; my @threads; my %results :shared; for my $url ( @urls ) { push @threads, async{ printf "%d : fetching $url\n", threads->tid; $results{ $url } = get $url and printf "%d : got %d bytes\n", threads->tid, length $results{ $url } or warn "failed to get $url"; }; } print "Waiting for threads"; $_->join for @threads; print "threads completed"; ### process the content here. print "$_ : ", length $results{ $_ } for keys %results; [download] Output: >t-get.pl 1 : fetching http://news.bbc.co.uk/1/hi/default.stm 2 : fetching http://q-lang.sourceforge.net/qdoc.html 3 : fetching file://localhost/c:/perl/html/index.html Waiting for threads 4 : fetching http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lam +bda.pm 5 : fetching http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/f +uncional/L-99_Ninety-Nine_Lisp_Problems.html 3 : got 378 bytes 5 : got 56912 bytes 4 : got 65372 bytes 1 : got 76380 bytes 2 : got 1244518 bytes threads completed http://news.bbc.co.uk/1/hi/default.stm : 76380 http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html : 56912 http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm : 6537 +2 file://localhost/c:/perl/html/index.html : 378 http://q-lang.sourceforge.net/qdoc.html : 1244518 [download] Though if the list of urls to fetch grows much beyond a dozen or so, you'd want a slightly more sophistcated version that metered the number of concurrent gets and re-used the threads. That's also quite simple to write, but requires a little more thought. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: My first socket program is SLOW? by tilly (Archbishop) on Jan 15, 2009 at 03:29 UTC
Well you have multiple options. First there is the possibility of forking. Secondly you can do a bunch of system commands with &s then proceed when wait says that they are done. Thirdly you can use asynchronous IO, which is basically having multiple handles open, then using select to read from them one at a time. If you want to follow the latter approach there is a discussion of how to do it in raw code in perlipc. Life may become easier (at least once you've learned the library) if you use a library that is meant to support asynchronous IO. CPAN has many such libraries. Including POE, Event::Lib, and the new kid on the block, IO::Lambda. If you want to look at the last you probably want to look at the discussion about it at IO::Lambda: call for participation, and particularly my response at Re^2: regarding 1.02 (was Re: IO::Lambda: call for participation).	[reply]
Re^3: My first socket program is SLOW? by marto (Cardinal) on Jan 15, 2009 at 10:00 UTC
"I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's" The general advice is not to use a regex to parse/manipulate HTML/XML but use one of the parsing modules on cpan. For HTML look at modules such as HTML::Parses and HTML::TokeParser, also see HTML::TokeParser help - parsing headlines or use super search to find more examples. Hope this helps Martin	[reply]
Re^4: My first socket program is SLOW? by ttlgreen (Sexton) on Jan 15, 2009 at 21:43 UTC
Interesting... Fetching a page using the get method from LWP::Simple is alot faster than what I was doing... I guess I should have taken the advice and gone with that in the first place. Looks like the solution here is to re-write the whole thing to use LWP::Simple, use threads to do it all at once, and call it a day! Thanks for the great help everyone. By the way (Martin) is there a page/discussion somewhere about /why/ "The general advice is not to use a regex to parse/manipulate HTML/XML...". I'd be curious to know more about that. Basically all I'm doing is taking a few numbers and things like that out of the pages, not trying to do something to all the html tags. I'll check out those modules/pages anyway though. Thanks!	[reply]
Re^5: My first socket program is SLOW? by marto (Cardinal) on Jan 16, 2009 at 11:06 UTC
Glad LWP::Simple worked out for you. Regards the regex HTML debate, there are various nodes here to read on the subject. Removing html comments with regex contains some good replies including davido's reply. Another example is Remove all html tag Except 'sup', super search will find you many more examples and explanations. Cheers, Martin	[reply]
Re^3: My first socket program is SLOW? by gone2015 (Deacon) on Jan 15, 2009 at 23:53 UTC
It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert. The biggest win will be from fetching things in parallel, so you may be leaving this as an unsolved puzzle... ...in case it's still of interest, apart from the timing of packets the key thing to look at is the Window Size. If the receiver window size is small, then the sender may delay sending packets. This is what Wireshark shows for a trivial example TCP session: No Time Src Dest P Info 1 00.884 ..130 ..141 TCP ..4 > ..1 [SYN] Seq=0 Len=0 MSS=1460 WS=6 2 00.884 ..141 ..130 TCP ..1 > ..4 [SYN, ACK] Seq=0 Ack=1 Win=60984 Len=0 MSS=1452 WS=0 3 00.884 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=1 Ack=1 Win=5888 Len=0 4 00.892 ..130 ..141 TCP ..4 > ..1 [PSH, ACK] Seq=1 Ack=1 Win=5888 Len=87 5 00.893 ..130 ..141 TCP ..4 > ..1 [FIN, PSH, ACK] Seq=88 Ack=1 Win=5888 Len=90 6 00.893 ..141 ..130 TCP ..1 > ..4 [ACK] Seq=1 Ack=179 Win=60807 Len=0 7 01.000 ..141 ..130 TCP ..1 > ..4 [FIN, ACK] Seq=1 Ack=179 Win=60807 Len=0 8 01.000 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=179 Ack=2 Win=5888 Len=0 where I've edited the time, IP addresses and port numbers in the interests of conciseness. This shows: ..130 opening a TCP converstaion with ..141, note the "Window Scaling" `WS=6`. ..141 acknowledging the TCP open, returning a "Window Size" of `Win=60984` and specifying no "Window Scaling" `WS=0`. ..130 completing the TCP open, and returning a "Window Size" of `Win=5888` -- which, given ..130's declared scaling of `WS=6`, means its Window Size is 5888 * 2**6 = 376832 (!) etc. the rest of the TCP conversation -- noting that all Window Sizes returned by ..130 must be multiplied up by its declared Window Scaling. The thing to look for is the Window Size being advertised by your machine. If this reduces to something small when using Perl, but not when using Lynx, then that may be the problem. Mind you, these days Window Sizes are pretty big ! The other thing to look for would be your machine being slower to acknowledge stuff when using Perl, or acknowledging smaller amounts each time than when using Lynx.	[reply] [d/l] [select]