Re^2: Huge data file and looping best practices

thanks @przemo. I'll check out Bit::Vector. We think there are only about 400,000 unique characteristic sets among the 8 million patents, so we'll probably end up dealing with that data set instead, on account of the absurd amount of data this program would return.

Comment on Re^2: Huge data file and looping best practices

Replies are listed 'Best First'.
Re^3: Huge data file and looping best practices by davido (Cardinal) on Apr 27, 2009 at 02:35 UTC
If you have 400,000 unique characteristic sets among the 8 million patients, now you're getting somewhere. If you could find a way to consistently stringify a given set the same way each time it comes up, you could turn that into a hash key, and as its value create a datastructure of patient names. Now you have a workable structure that could be split into manageable files based on the groupings by unique characteristics. ...just a thought, though I'm not sure how helpful it is. Dave	[reply]

Replies are listed 'Best First'.

Re^3: Huge data file and looping best practices
by davido (Cardinal) on Apr 27, 2009 at 02:35 UTC

If you have 400,000 unique characteristic sets among the 8 million patients, now you're getting somewhere. If you could find a way to consistently stringify a given set the same way each time it comes up, you could turn that into a hash key, and as its value create a datastructure of patient names. Now you have a workable structure that could be split into manageable files based on the groupings by unique characteristics.

...just a thought, though I'm not sure how helpful it is.

Dave

[reply]