Re: Working with large amount of data

You've received a bunch of good suggestions. Myself, I'd go for the database approach, but perhaps that's because I'm a database guy. But don't take my word for it.

Grab a small chunk of that terabyte file and push it through each of the suggested approaches, keeping track of speed and memory usage. Slowly increase the chunk size -- ideally you'll be able to see what kind of resource curve each approach will require.

Go with the approach that seems that it will work 'best' for your definition of 'best'. (If you have time, try each approach on the whole file and see what performance you get -- we'd love to hear the results.)

Final word: it's OK to be in love with a particular approach, but you have to be scientific about this kind of thing. Abandon emotion, and apply logic and measurement to the problem. The number's don't lie. Much. :)

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Comment on Re: Working with large amount of data