in reply to Re: Memory Use and/or efficiency passing scalars to subs
in thread Memory Use and/or efficiency passing scalars to subs

I am processing html and xml retrieved via LWP::Simple Get() which returns the link contents to me in a scalar. I then process the $html/$xml, checking for changes and/or making changes as required.

So, the scalars are not HUGE by todays standard. They are in the 40-64K size on average with some upto several hundred Kb. The problem is more about the fact that I am working through 100s to 1000s.

Anyhow, one process I ran took over 13 hours to complete, which is hard to live with. So, I am looking to speed things up.

One of the things I went looking for (among others) was to see if I was making unnecessary copies of data. Being new to perl I was not sure how arguments were passed to subroutines i.e. by value or by reference (aka ptrs).

I found a statement in a book on Perl that says "When you pass scalars to subroutines they are passed by reference,... which acts like the address of the scalar.". The books also says that arrays etc. are copied into @_.

Hummm, I thought, I need to look into what's going on here. Which is in part what prompted my questions. Thanks in advance for any insights.

  • Comment on Re: Re: Memory Use and/or efficiency passing scalars to subs

Replies are listed 'Best First'.
Re: Re: Re: Memory Use and/or efficiency passing scalars to subs
by tachyon (Chancellor) on Aug 31, 2003 at 07:08 UTC

    So, I am looking to speed things up.

    You are almost certainly barking at the wrong tree.

    You seem to be assuming that passing/copying large scalars makes much difference to runtime. Memory use sure. Runtime - not really, only if you get into swap.

    I will almost guarantee you that 99% of your runtime is spent in LWP - getting (waiting) the data.

    I would suggest Benchmarking before you try to optimise an area that probably has nothing at all to do with your speed issue.

    Assuming that I am right the easiest practical solution is to split your code into GET and a MUNGE units - this also makes the Benchmark a breeze. Anyway you will typically want to run 10-100 parallel LWP agents to pull down data as fast as your bandwidth/the target servers will deliver it. LWP::Parallel::UserAgent will probably be a lot more useful than LWP::Simple. Note don't accidentally implement a DOS attack on your target servers. First it is not nice. Second, some firewall implementations will lock your Agent/IP out if you hit it too hard.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Yes, that thought occured to me not too long after my original post.

      I noticed I was not always receieving what I exptected from Get() on any given address. After taking a closer look I saw that I was getting DNS errors, timeouts from the site itself, etc. So, I started logging these and it soon became apparent that at least part (if not most) of the problem in taking too long was related to the Get() operations and/or just accessing the site itself.

      I certainly appreciate the tip and advice on using LWP::Parallel::UserAgent. I will look into it. I have so much to learn about Perl; what started as some simple scripting is growing rapidly, as does my sense of confusion from time to time ;-)
      I am sure I will run down the wrong rat-hole many more times!

      Also, regarding the DOS attack issue: that is certainly something I DO NOT want to do, the script MUST play nice. Currently, I really don't want to hit the servers more than a couple of times a second.

      Any tips on how I can monitor how hard I'm hitting a server?

      BTW: I am "reasonably" certain that the timeout messages I received from the servers were not from me hitting them too hard. My sanity check was to stop the script, bring up IE and enter the URL. I got the same message "Timeout on your request" or something like that. After waiting a few minutes it worked. So, I am just assuming the server itself was too busy.

      Does that seem reasonable?

      Thanks

        Sorry, I posted this Anonymously by mistake.