in reply to Re: Re: Re: Re: Memory Use and/or efficiency passing scalars to subs
in thread Memory Use and/or efficiency passing scalars to subs

Sorry, I posted this Anonymously by mistake.
  • Comment on Re: Re: Re: Re: Re: Memory Use and/or efficiency passing scalars to subs

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Memory Use and/or efficiency passing scalars to subs
by tachyon (Chancellor) on Sep 01, 2003 at 00:42 UTC

    Does this sound reasonable

    No. Assuming you are trying to mine data from big name sites like Google/Yahoo/Currency Conversion/Web Filter.... They have lots of fast servers, and never timout. In fact servers issue a 503 Service Unavailable error to indicate they are overloaded not a 408 Request Timed Out. The behaviour you describe is indicitive of being locked out ie works/stop working/wait/works again. The fact that you get the same behaviour in your browser indicates you are being locked out based on IP address (easier to fake) or MAC address (need lots of network cards in your box).

    This is not a tutorial on how to 'hack' such sites. However the following info may be useful. Your requesting agent sends a string (the Agent string) that in theory says what browser is making the request. You want to pretend to be IE. LWP lets you fake this. Get some strings and fake it. This will only help with really dumb firewalls (but there are some :-)

    If you have multiple target sites then do a round robin so you do site A site B site C....with a given Agent string, change you fake agent, then go through site A, B, C ... again. This keeps your hit rate per site low but you still get throughput.

    You probably want to add cookie support to your agent as not supporting cookies is an idication you are not really using IE. Once again YMMV.

    If you have an IP lockout then you need to come from multiple IPs. On box can have lotsa IPs it listens to with a single network card. Think promiscuous mode.

    Any firewall worth its salt will be running based on MAC address of the network card (unique) so you can only really circumvent these using multpile network cards. Not entirely true but you would need to know a hell of a lot about TCP/IP to prove that wrong and take advantage of it. A multicard solution may not be an option for you but works well as you get multiple IPs and multiple MAC which you can combine with multiple Agent Strings to make it look like your queries are coming from all over the Net.

    At its simplest just add the Agent String, establish the max hit number/rate before lockout and code accordingly. You can shorten the timeout right down and push failures onto an @later list. If for example you need to hit 10 servers, with 100 requests each and you are happy to wait for say 1 hour to get your data then you need to do 1 request every 3.6 seconds. If you round robin you will hit each server once every 40 seconds - this is a reasonable rate for a real user so unlikely to trigger the lockout behaviour. To do this just sleep a few seconds between requests. It is better to sleep 3 seconds than hit a 180 second (the default) LWP timout and still get no data. If you have a server with decent connectivity and are hitting big sites you will get you data in 500-1000 msec on average.

    Good luck.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print