in reply to Re: Re: Memory Use and/or efficiency passing scalars to subs
in thread Memory Use and/or efficiency passing scalars to subs

So, I am looking to speed things up.

You are almost certainly barking at the wrong tree.

You seem to be assuming that passing/copying large scalars makes much difference to runtime. Memory use sure. Runtime - not really, only if you get into swap.

I will almost guarantee you that 99% of your runtime is spent in LWP - getting (waiting) the data.

I would suggest Benchmarking before you try to optimise an area that probably has nothing at all to do with your speed issue.

Assuming that I am right the easiest practical solution is to split your code into GET and a MUNGE units - this also makes the Benchmark a breeze. Anyway you will typically want to run 10-100 parallel LWP agents to pull down data as fast as your bandwidth/the target servers will deliver it. LWP::Parallel::UserAgent will probably be a lot more useful than LWP::Simple. Note don't accidentally implement a DOS attack on your target servers. First it is not nice. Second, some firewall implementations will lock your Agent/IP out if you hit it too hard.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

  • Comment on Re: Re: Re: Memory Use and/or efficiency passing scalars to subs

Replies are listed 'Best First'.
Re: Re: Re: Re: Memory Use and/or efficiency passing scalars to subs
by Anonymous Monk on Aug 31, 2003 at 15:04 UTC
    Yes, that thought occured to me not too long after my original post.

    I noticed I was not always receieving what I exptected from Get() on any given address. After taking a closer look I saw that I was getting DNS errors, timeouts from the site itself, etc. So, I started logging these and it soon became apparent that at least part (if not most) of the problem in taking too long was related to the Get() operations and/or just accessing the site itself.

    I certainly appreciate the tip and advice on using LWP::Parallel::UserAgent. I will look into it. I have so much to learn about Perl; what started as some simple scripting is growing rapidly, as does my sense of confusion from time to time ;-)
    I am sure I will run down the wrong rat-hole many more times!

    Also, regarding the DOS attack issue: that is certainly something I DO NOT want to do, the script MUST play nice. Currently, I really don't want to hit the servers more than a couple of times a second.

    Any tips on how I can monitor how hard I'm hitting a server?

    BTW: I am "reasonably" certain that the timeout messages I received from the servers were not from me hitting them too hard. My sanity check was to stop the script, bring up IE and enter the URL. I got the same message "Timeout on your request" or something like that. After waiting a few minutes it worked. So, I am just assuming the server itself was too busy.

    Does that seem reasonable?

    Thanks

      Sorry, I posted this Anonymously by mistake.

        Does this sound reasonable

        No. Assuming you are trying to mine data from big name sites like Google/Yahoo/Currency Conversion/Web Filter.... They have lots of fast servers, and never timout. In fact servers issue a 503 Service Unavailable error to indicate they are overloaded not a 408 Request Timed Out. The behaviour you describe is indicitive of being locked out ie works/stop working/wait/works again. The fact that you get the same behaviour in your browser indicates you are being locked out based on IP address (easier to fake) or MAC address (need lots of network cards in your box).

        This is not a tutorial on how to 'hack' such sites. However the following info may be useful. Your requesting agent sends a string (the Agent string) that in theory says what browser is making the request. You want to pretend to be IE. LWP lets you fake this. Get some strings and fake it. This will only help with really dumb firewalls (but there are some :-)

        If you have multiple target sites then do a round robin so you do site A site B site C....with a given Agent string, change you fake agent, then go through site A, B, C ... again. This keeps your hit rate per site low but you still get throughput.

        You probably want to add cookie support to your agent as not supporting cookies is an idication you are not really using IE. Once again YMMV.

        If you have an IP lockout then you need to come from multiple IPs. On box can have lotsa IPs it listens to with a single network card. Think promiscuous mode.

        Any firewall worth its salt will be running based on MAC address of the network card (unique) so you can only really circumvent these using multpile network cards. Not entirely true but you would need to know a hell of a lot about TCP/IP to prove that wrong and take advantage of it. A multicard solution may not be an option for you but works well as you get multiple IPs and multiple MAC which you can combine with multiple Agent Strings to make it look like your queries are coming from all over the Net.

        At its simplest just add the Agent String, establish the max hit number/rate before lockout and code accordingly. You can shorten the timeout right down and push failures onto an @later list. If for example you need to hit 10 servers, with 100 requests each and you are happy to wait for say 1 hour to get your data then you need to do 1 request every 3.6 seconds. If you round robin you will hit each server once every 40 seconds - this is a reasonable rate for a real user so unlikely to trigger the lockout behaviour. To do this just sleep a few seconds between requests. It is better to sleep 3 seconds than hit a 180 second (the default) LWP timout and still get no data. If you have a server with decent connectivity and are hitting big sites you will get you data in 500-1000 msec on average.

        Good luck.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print