in reply to most efficient way to scrape data and put it into a tsv file

10 variables per iteration is nothing. In addition, Perl buffers I/O by default, so there'd be little difference even for large data sets (in which case the array method would consume more memory).

Additionally, the network I/O will probably dominate your runtime by a few orders of magnitude in comparison to anything else, so even if you managed to make any significant local difference with these micro-optimizations, the script still wouldn't run appreciably faster.

That said, I prefer to separate my code into phases where I fetch input, munge data or do calculations, and produce output, simply because it helps separation of concerns and therefore makes the code more maintainable.

Makeshifts last the longest.

  • Comment on Re: most efficient way to scrape data and put it into a tsv file

Replies are listed 'Best First'.
Re^2: most efficient way to scrape data and put it into a tsv file
by dchandler (Sexton) on Aug 30, 2004 at 05:03 UTC
    What do you mean that perl buffers I/O automatically? Also, if I have an option that puts a dot on the screen each time I successfully grab a screen and if i use the option $|=1, what am I doing? I've used $|=1 so that the dots come on the screen as I like. Am i harming performance by doing this?

      Yes, setting $| on a filehandle disables buffering. If you do that to produce dots as you go, I guess you're taking a penalty of a few nanoseconds per dot. Considering establishing a connection and pulling a page over HTTP will take anywhere from a few milliseconds to several seconds, I don't see why you should even care about about your prints.

      Performance should be your very last concern, and only if you have found that your code is actually too slow in practice. Even then, you don't start guessing at what could be made faster: you profile the code and look at where it's actually spending its time. If a script takes 3 minutes to run and you accelerate a random part of the code by 10 times, it doesn't help you much if that part of the code only took 1 second total of the runtime anyway. You're now down from 3:00 minutes to 2:59.1 minutes.

      The network I/O is going to take so much more time in your script than anything else it does that no optimization is going to matter. If you want it to go faster, make your downloads go faster. If you can't do that, don't bother changing anything.

      Makeshifts last the longest.

      Perl buffers I/O by default. Files are block buffered, stdout is line-buffered to a terminal and block buffered when redirected, and stderr is unbuffered. This means waits until it sees the end of a line or the buffer to do an operating system write.

      Are you writing your output to STDOUT or to a file? It sounded like the latter. Setting $| = 1 only unbuffers STDOUT. It is possible to unbuffer files, but this mainly used for logs where each line should be independent.