sskohli has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am creating an app that would help students and researchers of marketing. The app, would be in a form of a survey/questionnaire, which would take the response from different individuals and try to find a pattern out of it. For example age group 10-15 likes candies or something like that. Since this would require input of a HUGE amount of data, I am worried about the data structure to use and also about the response time and time to find the pattern. Should I use an XML kind of a DOM structure,
<responses> <response1> <ques1/> <ques2/> . <quesn/> </response1> . . . <responsen> <ques1/> <ques2/> . <quesn/> </responsen> </responses>
and then try to find patters between responses/respondents? or should i use a database with
select count(*) from tbl where ques1 = ans1...some complex query...

Summarising
1. Efficient Data Structure with fast lookup.
2. Reducing Searching time within sub categories.

Hope the perl monks would bestow their wisdom on me mere mortal :)
sandeep

Replies are listed 'Best First'.
Re: Research App Optimization
by tilly (Archbishop) on Oct 31, 2006 at 13:42 UTC
    How much data is "HUGE"? Are you thinking of, say, asking 100,000 people 100 questions each? That's 10 million rows. For current computers, that is not a problem.

    Just stick the data in a database. When you need to process it, the time taken to process the data will be minor compared to the effort of figuring out what questions to ask. If you know that particular items are going to be relevant (for instance the survey you're looking at, the age of the respondant, etc) then add appropriate indexes. Trying to optimize further at this point is seriously premature.

    Several additional comments. First of all if the researchers have any statistical analysis packages that they are used to using, be prepared to export data in a format that those packages can use. Seriously, if you can write yourself out of the "thought-question-answer" loop, do so. On the same lines, if they don't have favorite tools, then seriously consider giving them one that doesn't involve you. For instance export the data in some Access friendly format, load the data into Access, then show them how to use Access. And finally, be prepared to learn some statistics yourself. For instance I've used tools like Statistics::Regression to good effect, but unless you know what they do or whether they fit, you won't be able to use them. (Or worse yet - and very commonly - you'll use them as a magic oracle and misinterpret the results.)

      Hi Tilly, Thanks for your response. Yes we are looking at atleast 20 to 50 thousand responses, answering around 50-80 questions. Also the same respondents could be asked more questions in the future, so the question database would go on increasing over time. I looked at Data::Mining module, but it doesn't cater to what I need. I am looking at a perl hash of hash right now and storing them into a file using Storage.pm or something and loading it in memory when required, to reduce the time. There could be soooo many associations and rules, I am thinking of writing a rule-engine too, I researched on yagg yesterday. It could be like In year 1980 in city 'new york' maxSold fabrickType for age-group 20-25 ? Can I store this historical data in the db? or I am better off using hashes. I preferred XML, as my hash would easily form an XML DOM Tree.
      <results> <result id='1'> <year = '1980'/> <sales> <sale> <location>New York</location> <cotton>5.6</cotton> <nylon>8.4</nylon> </saled> </sales> </result> </results>
      I will look into PDL and Statistic Package as well. Thanks All. Sandeep
        Databases are for data. With that little volume, there will be no problem storing it all in the database and worry about what you're going to do with it later. Yes, including historical data. And for intermediate manipulations, the database already can do a lot of what you want - you just have to write the query.

        Should you change your mind, it is easy enough to export it all into any format you want. And if you're worried about performance, don't. Databases are a lot faster than searching through and processing XML files. (In fact processing XML is a very silly thing to do if you care about performance.)

        However one note. You'll want to think through what information you're capturing, and how to organize it. Databases really start to shine when you structure your data appropriately for them. In fact if you can, I'd suggest finding someone local who knows databases, discuss the problem with them, and have them suggest a table structure for you.

Re: Research App Optimization
by rhesa (Vicar) on Oct 31, 2006 at 12:35 UTC
    Ideally, you would collect the input into a database or a csv file, and do the analysis offline in a statistics package like SPSS.

    Most databases can do rudimentary statistical analysis, so if that's enough for your purpose you might as well use it. Some SQL keywords are avg, stddev, group by, rollup, cube.

    Alternatively, you might be interested in PDL, the Perl Data Language.

    XML is definitely not the best storage mechanism for this.

Re: Research App Optimization
by jbert (Priest) on Oct 31, 2006 at 13:58 UTC
    Looking for such patterns might come under the heading of data mining. Why am I not surprised that CPAN has a module in this area? Data::Mining::AssociationRules (no idea on quality I'm afraid). It might pay to look at this module and if it suits your needs, store your data in whatever format it wants natively.

    By the way, having xxx1, xxx2 in code or data structures is often a warning sign that you might want to consider some kind of container or list structure. Also, it as useful to put effort into good naming in data models as it is for variables. If something is the answer to a question then it might be better named answer (apologies if I've misunderstood your data structure).

    In your XML (and I agree with others that XML is primarily a data transfer language rather than a data storage/querying lanugage) it might be better to have:

    <responses> <response id="1"> <answer question="1">I like cheese</answer> <answer question="2">Halloween is scary</answer> ... </response> ... </responses>
    Depending on anonymity, you might add other attributes to identify a particular <response>, such as name, date etc.

    If you're writing code (or XPATH queries) to examine your original data structure, you'll end up duplicating things to look for all the different tag types.

    I bring this up, because most of the same issues apply to data modelling in SQL as well, so its worth thinking about at this stage.

      >> By the way, having xxx1, xxx2 in code or data structures is often a warning sign that you might want to consider some kind of container or list structure. >>Also, it as useful to put effort into good naming in data models as it is for variables.

      You are soo right, my boss and my teachers used to eat my head about proper naming convention, When i started this project, i re wrote code two times, becausei forgot what the variables meant and what the code section did, now i am putting adequate comments and using Long and descriptive variable names.
      Learning this the hardway
      thanks
      sandeep
Re: Research App Optimization
by themage (Friar) on Oct 31, 2006 at 11:17 UTC
    Hi sskohli,

    My opinion is as bad as anyother, and probably worst than most, but I think that XML is a good thing when you need to transfer data between automatic systems, or if you need a structured configuration system.

    When your needs are storing huge amounts of data, and specially if you need to perform complex filtering and lookups in that data, XML is not the ideal way.

    I don't see, in true, any advantage in using XML in case.

    It'll be unmanagable to create the XML directly, so you need an interface to insert your data in the XML. Reading data from the XML will allways force you to read the XML file completly. Processing XML is a complex task, and a database, with a correctly defined structure and indexes will the do job with much more eficience.

    Just my 2 cents,