in reply to Re: My first socket program is SLOW?
in thread My first socket program is SLOW?

Thanks for the replies everyone... (I wish there was a button on here to reply to everyone at once ;)

It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert.

I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another. If that can be done it would be plenty fast enough for me.

I've been looking all over for a perl equivalent to putting "&" after a command in bash but I can't find anything other than the fork function.

Basically I've got a function in a module I wrote called getPage($url) that fetches the page. Is there any way to background each getPage() call so they all run at once?

I would just do "lynx -dump $url &" but I'm trying to make this a portable script that doesn't require a bunch of linux programs to function properly. I'm pretty new to perl still so I'm sure there must be some way I just don't know of.

Replies are listed 'Best First'.
Re^3: My first socket program is SLOW? (threads)
by BrowserUk (Patriarch) on Jan 15, 2009 at 09:46 UTC
    I'm starting to think the best solution (since I'm actually trying to fetch about 6 pages and then snatch stuff out of them with regex's) would be to perform the requests simultaneously rather than one after another.

    It's simple with threads:

    #! perl -slw use strict; use threads; use threads::shared; use LWP::Simple; my @urls = qw[ http://news.bbc.co.uk/1/hi/default.stm http://q-lang.sourceforge.net/qdoc.html file://localhost/c:/perl/html/index.html http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html ]; my @threads; my %results :shared; for my $url ( @urls ) { push @threads, async{ printf "%d : fetching $url\n", threads->tid; $results{ $url } = get $url and printf "%d : got %d bytes\n", threads->tid, length $results{ $url } or warn "failed to get $url"; }; } print "Waiting for threads"; $_->join for @threads; print "threads completed"; ### process the content here. print "$_ : ", length $results{ $_ } for keys %results;

    Output:

    >t-get.pl 1 : fetching http://news.bbc.co.uk/1/hi/default.stm 2 : fetching http://q-lang.sourceforge.net/qdoc.html 3 : fetching file://localhost/c:/perl/html/index.html Waiting for threads 4 : fetching http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lam +bda.pm 5 : fetching http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/f +uncional/L-99_Ninety-Nine_Lisp_Problems.html 3 : got 378 bytes 5 : got 56912 bytes 4 : got 65372 bytes 1 : got 76380 bytes 2 : got 1244518 bytes threads completed http://news.bbc.co.uk/1/hi/default.stm : 76380 http://www.ic.unicamp.br/~meidanis/courses/mc336/2006s2/funcional/L-99 +_Ninety-Nine_Lisp_Problems.html : 56912 http://search.cpan.org/~karasik/IO-Lambda-1.02/lib/IO/Lambda.pm : 6537 +2 file://localhost/c:/perl/html/index.html : 378 http://q-lang.sourceforge.net/qdoc.html : 1244518

    Though if the list of urls to fetch grows much beyond a dozen or so, you'd want a slightly more sophistcated version that metered the number of concurrent gets and re-used the threads. That's also quite simple to write, but requires a little more thought.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: My first socket program is SLOW?
by tilly (Archbishop) on Jan 15, 2009 at 03:29 UTC
    Well you have multiple options. First there is the possibility of forking. Secondly you can do a bunch of system commands with &s then proceed when wait says that they are done. Thirdly you can use asynchronous IO, which is basically having multiple handles open, then using select to read from them one at a time.

    If you want to follow the latter approach there is a discussion of how to do it in raw code in perlipc. Life may become easier (at least once you've learned the library) if you use a library that is meant to support asynchronous IO. CPAN has many such libraries. Including POE, Event::Lib, and the new kid on the block, IO::Lambda. If you want to look at the last you probably want to look at the discussion about it at IO::Lambda: call for participation, and particularly my response at Re^2: regarding 1.02 (was Re: IO::Lambda: call for participation).

Re^3: My first socket program is SLOW?
by marto (Cardinal) on Jan 15, 2009 at 10:00 UTC

      Interesting... Fetching a page using the get method from LWP::Simple is alot faster than what I was doing... I guess I should have taken the advice and gone with that in the first place.

      Looks like the solution here is to re-write the whole thing to use LWP::Simple, use threads to do it all at once, and call it a day!

      Thanks for the great help everyone.

      By the way (Martin) is there a page/discussion somewhere about /why/ "The general advice is not to use a regex to parse/manipulate HTML/XML...".

      I'd be curious to know more about that. Basically all I'm doing is taking a few numbers and things like that out of the pages, not trying to do something to all the html tags. I'll check out those modules/pages anyway though.

      Thanks!

Re^3: My first socket program is SLOW?
by gone2015 (Deacon) on Jan 15, 2009 at 23:53 UTC
    It's not unbearably slow but it seems to be slower than lynx which is weird to me. Anyway I've been watching the communications with wireshark. I can't find anything that stands out but then again I'm not an expert.

    The biggest win will be from fetching things in parallel, so you may be leaving this as an unsolved puzzle...

    ...in case it's still of interest, apart from the timing of packets the key thing to look at is the Window Size. If the receiver window size is small, then the sender may delay sending packets.

    This is what Wireshark shows for a trivial example TCP session:

      No Time     Src  Dest   P Info
       1 00.884 ..130 ..141 TCP ..4 > ..1 [SYN] Seq=0 Len=0 MSS=1460 WS=6
       2 00.884 ..141 ..130 TCP ..1 > ..4 [SYN, ACK] Seq=0 Ack=1 Win=60984 Len=0 MSS=1452 WS=0
       3 00.884 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=1 Ack=1 Win=5888 Len=0
       4 00.892 ..130 ..141 TCP ..4 > ..1 [PSH, ACK] Seq=1 Ack=1 Win=5888 Len=87
       5 00.893 ..130 ..141 TCP ..4 > ..1 [FIN, PSH, ACK] Seq=88 Ack=1 Win=5888 Len=90
       6 00.893 ..141 ..130 TCP ..1 > ..4 [ACK] Seq=1 Ack=179 Win=60807 Len=0
       7 01.000 ..141 ..130 TCP ..1 > ..4 [FIN, ACK] Seq=1 Ack=179 Win=60807 Len=0
       8 01.000 ..130 ..141 TCP ..4 > ..1 [ACK] Seq=179 Ack=2 Win=5888 Len=0
    
    where I've edited the time, IP addresses and port numbers in the interests of conciseness. This shows:

    1. ..130 opening a TCP converstaion with ..141, note the "Window Scaling" WS=6.

    2. ..141 acknowledging the TCP open, returning a "Window Size" of Win=60984 and specifying no "Window Scaling" WS=0.

    3. ..130 completing the TCP open, and returning a "Window Size" of Win=5888 -- which, given ..130's declared scaling of WS=6, means its Window Size is 5888 * 2**6 = 376832 (!)

    4. etc. the rest of the TCP conversation -- noting that all Window Sizes returned by ..130 must be multiplied up by its declared Window Scaling.

    The thing to look for is the Window Size being advertised by your machine. If this reduces to something small when using Perl, but not when using Lynx, then that may be the problem. Mind you, these days Window Sizes are pretty big !

    The other thing to look for would be your machine being slower to acknowledge stuff when using Perl, or acknowledging smaller amounts each time than when using Lynx.