comment on

Hello. Recently I've been downloading lots of data from various websites. I have started being very kind and giving very generous pauses to avoid eating up bandwidth. My question regards how to most efficiently use the time when I'm not sending requests to the website.

Each webpage I access has a single observation with aproximately, 10 variables. The order of my current script is like so:

1) get the source of a website.
2) use regular expressions to extract relevant values for 10 variables.
3) After extracting EACH variable, write it immediately to a textfile.
4) repeat.

would it be faster to save the ten values to an array and then write the contents of the array to a file? If it's faster to do this for one webpage, would it be faster to create some kind of array of arrays and capture 20 observations * 10 variable values? Supposing that initially, the array approach is more efficient, how can I discover the most efficient array size before writing to text.

In reply to most efficient way to scrape data and put it into a tsv file by dchandler

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.