most efficient way to scrape data and put it into a tsv file

dchandler has asked for the wisdom of the Perl Monks concerning the following question:

Hello. Recently I've been downloading lots of data from various websites. I have started being very kind and giving very generous pauses to avoid eating up bandwidth. My question regards how to most efficiently use the time when I'm not sending requests to the website.

Each webpage I access has a single observation with aproximately, 10 variables. The order of my current script is like so:

1) get the source of a website.
2) use regular expressions to extract relevant values for 10 variables.
3) After extracting EACH variable, write it immediately to a textfile.
4) repeat.

would it be faster to save the ten values to an array and then write the contents of the array to a file? If it's faster to do this for one webpage, would it be faster to create some kind of array of arrays and capture 20 observations * 10 variable values? Supposing that initially, the array approach is more efficient, how can I discover the most efficient array size before writing to text.

Comment on most efficient way to scrape data and put it into a tsv file

Replies are listed 'Best First'.
Re: most efficient way to scrape data and put it into a tsv file by Aristotle (Chancellor) on Aug 29, 2004 at 21:41 UTC
10 variables per iteration is nothing. In addition, Perl buffers I/O by default, so there'd be little difference even for large data sets (in which case the array method would consume more memory). Additionally, the network I/O will probably dominate your runtime by a few orders of magnitude in comparison to anything else, so even if you managed to make any significant local difference with these micro-optimizations, the script still wouldn't run appreciably faster. That said, I prefer to separate my code into phases where I fetch input, munge data or do calculations, and produce output, simply because it helps separation of concerns and therefore makes the code more maintainable. Makeshifts last the longest.	[reply]
Re^2: most efficient way to scrape data and put it into a tsv file by dchandler (Sexton) on Aug 30, 2004 at 05:03 UTC
What do you mean that perl buffers I/O automatically? Also, if I have an option that puts a dot on the screen each time I successfully grab a screen and if i use the option $\|=1, what am I doing? I've used $\|=1 so that the dots come on the screen as I like. Am i harming performance by doing this?	[reply]
Re^3: most efficient way to scrape data and put it into a tsv file by Aristotle (Chancellor) on Aug 30, 2004 at 10:55 UTC
Yes, setting `$\|` on a filehandle disables buffering. If you do that to produce dots as you go, I guess you're taking a penalty of a few nanoseconds per dot. Considering establishing a connection and pulling a page over HTTP will take anywhere from a few milliseconds to several seconds, I don't see why you should even care about about your prints. Performance should be your very last concern, and only if you have found that your code is actually too slow in practice. Even then, you don't start guessing at what could be made faster: you profile the code and look at where it's actually spending its time. If a script takes 3 minutes to run and you accelerate a random part of the code by 10 times, it doesn't help you much if that part of the code only took 1 second total of the runtime anyway. You're now down from 3:00 minutes to 2:59.1 minutes. The network I/O is going to take so much more time in your script than anything else it does that no optimization is going to matter. If you want it to go faster, make your downloads go faster. If you can't do that, don't bother changing anything. Makeshifts last the longest.	[reply]
Re^3: most efficient way to scrape data and put it into a tsv file by iburrell (Chaplain) on Aug 30, 2004 at 17:05 UTC
Perl buffers I/O by default. Files are block buffered, stdout is line-buffered to a terminal and block buffered when redirected, and stderr is unbuffered. This means waits until it sees the end of a line or the buffer to do an operating system write. Are you writing your output to STDOUT or to a file? It sounded like the latter. Setting `$\| = 1` only unbuffers STDOUT. It is possible to unbuffer files, but this mainly used for logs where each line should be independent.	[reply] [d/l]