in reply to Research App Optimization

How much data is "HUGE"? Are you thinking of, say, asking 100,000 people 100 questions each? That's 10 million rows. For current computers, that is not a problem.

Just stick the data in a database. When you need to process it, the time taken to process the data will be minor compared to the effort of figuring out what questions to ask. If you know that particular items are going to be relevant (for instance the survey you're looking at, the age of the respondant, etc) then add appropriate indexes. Trying to optimize further at this point is seriously premature.

Several additional comments. First of all if the researchers have any statistical analysis packages that they are used to using, be prepared to export data in a format that those packages can use. Seriously, if you can write yourself out of the "thought-question-answer" loop, do so. On the same lines, if they don't have favorite tools, then seriously consider giving them one that doesn't involve you. For instance export the data in some Access friendly format, load the data into Access, then show them how to use Access. And finally, be prepared to learn some statistics yourself. For instance I've used tools like Statistics::Regression to good effect, but unless you know what they do or whether they fit, you won't be able to use them. (Or worse yet - and very commonly - you'll use them as a magic oracle and misinterpret the results.)

Replies are listed 'Best First'.
Re^2: Research App Optimization
by sskohli (Initiate) on Nov 01, 2006 at 05:02 UTC
    Hi Tilly, Thanks for your response. Yes we are looking at atleast 20 to 50 thousand responses, answering around 50-80 questions. Also the same respondents could be asked more questions in the future, so the question database would go on increasing over time. I looked at Data::Mining module, but it doesn't cater to what I need. I am looking at a perl hash of hash right now and storing them into a file using Storage.pm or something and loading it in memory when required, to reduce the time. There could be soooo many associations and rules, I am thinking of writing a rule-engine too, I researched on yagg yesterday. It could be like In year 1980 in city 'new york' maxSold fabrickType for age-group 20-25 ? Can I store this historical data in the db? or I am better off using hashes. I preferred XML, as my hash would easily form an XML DOM Tree.
    <results> <result id='1'> <year = '1980'/> <sales> <sale> <location>New York</location> <cotton>5.6</cotton> <nylon>8.4</nylon> </saled> </sales> </result> </results>
    I will look into PDL and Statistic Package as well. Thanks All. Sandeep
      Databases are for data. With that little volume, there will be no problem storing it all in the database and worry about what you're going to do with it later. Yes, including historical data. And for intermediate manipulations, the database already can do a lot of what you want - you just have to write the query.

      Should you change your mind, it is easy enough to export it all into any format you want. And if you're worried about performance, don't. Databases are a lot faster than searching through and processing XML files. (In fact processing XML is a very silly thing to do if you care about performance.)

      However one note. You'll want to think through what information you're capturing, and how to organize it. Databases really start to shine when you structure your data appropriately for them. In fact if you can, I'd suggest finding someone local who knows databases, discuss the problem with them, and have them suggest a table structure for you.