#! perl -slw
use strict;
use threads;
use threads::shared;
use LWP::Simple;
my @urls = qw[
http://news.bbc.co.uk/1/hi/default.stm
http://q-lang.sourceforge.net/qdoc.html
file://localhost/c:/perl/html/index.html
http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm
http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99
+_Ninety-Nine_Lisp_Problems.html
];
my @threads;
my %results :shared;
for my $url ( @urls ) {
push @threads, async{
printf "%d : fetching $url\n", threads->tid;
$results{ $url } = get $url
and printf "%d : got %d bytes\n",
threads->tid, length $results{ $url }
or warn "failed to get $url";
};
}
print "Waiting for threads";
$_->join for @threads;
print "threads completed";
### process the content here.
print "$_ : ", length $results{ $_ } for keys %results;
Output: >t-get.pl
1 : fetching http://news.bbc.co.uk/1/hi/default.stm
2 : fetching http://q-lang.sourceforge.net/qdoc.html
3 : fetching file://localhost/c:/perl/html/index.html
Waiting for threads
4 : fetching http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lam
+bda.pm
5 : fetching http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/f
+uncional/L-99_Ninety-Nine_Lisp_Problems.html
3 : got 378 bytes
5 : got 56912 bytes
4 : got 65372 bytes
1 : got 76380 bytes
2 : got 1244518 bytes
threads completed
http://news.bbc.co.uk/1/hi/default.stm : 76380
http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99
+_Ninety-Nine_Lisp_Problems.html : 56912
http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm : 6537
+2
file://localhost/c:/perl/html/index.html : 378
http://q-lang.sourceforge.net/qdoc.html : 1244518
Though if the list of urls to fetch grows much beyond a dozen or so, you'd want a slightly more sophistcated version that metered the number of concurrent gets and re-used the threads. That's also quite simple to write, but requires a little more thought.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Well you have multiple options. First there is the possibility of forking. Secondly you can do a bunch of system commands with &s then proceed when wait says that they are done. Thirdly you can use asynchronous IO, which is basically having multiple handles open, then using select to read from them one at a time.
If you want to follow the latter approach there is a discussion of how to do it in raw code in perlipc. Life may become easier (at least once you've learned the library) if you use a library that is meant to support asynchronous IO. CPAN has many such libraries. Including POE, Event::Lib, and the new kid on the block, IO::Lambda. If you want to look at the last you probably want to look at the discussion about it at IO::Lambda: call for participation, and particularly my response at Re^2: regarding 1.02 (was Re: IO::Lambda: call for participation). | [reply] |
| [reply] |
Interesting... Fetching a page using the get method from LWP::Simple is alot faster than what I was doing... I guess I should have taken the advice and gone with that in the first place.
Looks like the solution here is to re-write the whole thing to use LWP::Simple, use threads to do it all at once, and call it a day!
Thanks for the great help everyone.
By the way (Martin) is there a page/discussion somewhere about /why/ "The general advice is not to use a regex to parse/manipulate HTML/XML...".
I'd be curious to know more about that. Basically all I'm doing is taking a few numbers and things like that out of the pages, not trying to do something to all the html tags. I'll check out those modules/pages anyway though.
Thanks!
| [reply] |
| [reply] |
It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert.
The biggest win will be from fetching things in parallel, so you may be leaving this as an unsolved puzzle...
...in case it's still of interest, apart from the timing of packets the key thing to look at is the Window Size. If the receiver window size is small, then the sender may delay sending packets.
This is what Wireshark shows for a trivial example TCP session:
No Time Src Dest P Info
1 00.884 ..130 ..141 TCP ..4 > ..1 [SYN] Seq=0 Len=0 MSS=1460 WS=6
2 00.884 ..141 ..130 TCP ..1 > ..4 [SYN, ACK] Seq=0 Ack=1 Win=60984 Len=0 MSS=1452 WS=0
3 00.884 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=1 Ack=1 Win=5888 Len=0
4 00.892 ..130 ..141 TCP ..4 > ..1 [PSH, ACK] Seq=1 Ack=1 Win=5888 Len=87
5 00.893 ..130 ..141 TCP ..4 > ..1 [FIN, PSH, ACK] Seq=88 Ack=1 Win=5888 Len=90
6 00.893 ..141 ..130 TCP ..1 > ..4 [ACK] Seq=1 Ack=179 Win=60807 Len=0
7 01.000 ..141 ..130 TCP ..1 > ..4 [FIN, ACK] Seq=1 Ack=179 Win=60807 Len=0
8 01.000 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=179 Ack=2 Win=5888 Len=0
where I've edited the time, IP addresses and port numbers in the interests of conciseness. This shows:
..130 opening a TCP converstaion with ..141, note the "Window Scaling" WS=6.
..141 acknowledging the TCP open, returning a "Window Size" of Win=60984 and specifying no "Window Scaling" WS=0.
..130 completing the TCP open, and returning a "Window Size" of Win=5888 -- which, given ..130's declared scaling of WS=6, means its Window Size is 5888 * 2**6 = 376832 (!)
etc. the rest of the TCP conversation -- noting that all Window Sizes returned by ..130 must be multiplied up by its declared Window Scaling.
The thing to look for is the Window Size being advertised by your machine. If this reduces to something small when using Perl, but not when using Lynx, then that may be the problem. Mind you, these days Window Sizes are pretty big !
The other thing to look for would be your machine being slower to acknowledge stuff when using Perl, or acknowledging smaller amounts each time than when using Lynx. | [reply] [d/l] [select] |