in reply to Re: Code efficiency / algorithm
in thread Code efficiency / algorithm

mmm... you definatelly make a good point there. I would like to experiment with this proposed solution :) I guess my question is, how easy will it be for me to get the input data files loaded into a db? (which db should I use... simple .db files or mySQL?)

Cheers,

David

Replies are listed 'Best First'.
Re: Re: Re: Code efficiency / algorithm
by DaveH (Monk) on Jan 14, 2003 at 08:53 UTC

    I would recommend looking into SQLite (http://www.sqlite.org/), or more specifically, DBD::SQLite by Matt Sargent (available on CPAN). This is a self-contained relational database in a single db file (not to be confused with DB_File - they are in no way related).

    A link to it was posted here by merlyn a few days ago, and I'm now seriously considering using it in my next project. It is much more manageable and, with the correct indexes, probably faster than rolling your own databases with someting like DB_File.

    At the very least, the module may be useful for prototyping the concept of storing the info in a relational database, before moving onto a fully fledged RDBMS if you discover that it is necessary.

    Cheers,

    -- Dave :-)


    $q=[split+qr,,,q,~swmi,.$,],+s.$.Em~w^,,.,s,.,$&&$$q[pos],eg,print
Re: Re: Re: Code efficiency / algorithm
by graff (Chancellor) on Jan 14, 2003 at 07:51 UTC
    Go for mySQL. Once you have your data files normalized into flat tables, loading them into mySQL will be really simple.

    But take your time in working out the table design, in order to make it fully normalized -- e.g. although you have just two data files, it looks like you may want at least three tables: one for companies, one for the "datafile1" records that relate to those companies, and the third for the "datafile2" records. The nature of the "other text content" in your data files may suggest other relations that could benefit from being indexed/related via one or more additional tables, apart from these three.

Re: Re: Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 14, 2003 at 09:51 UTC

    Personally, unless datafile2 is static, or you can arrange to have whatever is writing that file write directly to the database, I think that using RDBMS is likely to involve more trouble than its worth. The code involved to normalise and insert 80MB+ into the RDBMS is likely to take longer to write and longer to run than processing the data with perl. If the datafile is static that maybe mitigatable, but if you have to reprocess or keep munging large volumes of flat file in order to put it into the database, then the overhead doesn't seem worth it to me.

    Further, performing a join on tables where one of them is that size and every record has to be joined with one or more records from the smaller table is going to need quite substantial box and is not going to be quick.

    I would be of a different opinion if you can arrange to write the bigfile data directly to the RDBMS and you need to perform one or more additional processings of the big file data in some relational manner on a regular basis, then the process of conversion would start to make sense.

    As the popular disclaimer goes, just my 2 cents.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Only the first file would need to go into the database, and it would just be a simple table with company, start date, end date, and the text (w/maybe an index on start and end date). The second file you would just read sequentially and use the date as an argument to query the datafile1 table.