in reply to Huge data file and looping best practices

Without even looking at your code, my first reaction upon hearing about 480 columns, 8,000,000 lines and a 6G file is "put it into a database and let the database worry about that".

Once the data's in a database, you can group similar records together, look at just a subset, and so forth.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

  • Comment on Re: Huge data file and looping best practices

Replies are listed 'Best First'.
Re^2: Huge data file and looping best practices
by carillonator (Novice) on Apr 26, 2009 at 16:51 UTC
    @talex, will mySQL, for example, store the data more compactly, or allow for faster analysis? We're not concerned with subsets of data other than to break up the computation tasks into several chunks, across processors or computers. We really need all the data. Plus, we don't have the SQL skills to do the analysis that way.

      A database may not be the best solution here -- from reading the other posts, it could be that you're going to be more interested in 'clumping' each of the data points together, creating 'neighborhoods' of 'nearest neighbors'. My Systems Design professor Ed Jernigan did research along those lines.

      Perhaps a first cut would be some sort of encoding of each data point, then a 'clumping' based on that, with further analysis on the smaller 'clumps'.

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds