dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a long list of items containing the following fields: article title, authors, journal, abstract. Currently I'm storing them in flat files where each line contains all fields, tab delimited. A separate file holds the location information (file name and byte offset) for each article.

I'm using the article title as an identifier and then analyzing (basically just running regexps) the abstract. The other fields are not used.

The question is: have I chosen the best way for storage? I suspect the answer is no. Any other suggestions?

Replies are listed 'Best First'.
Re: Basic data storage question
by cleverett (Friar) on Jul 30, 2003 at 03:17 UTC
    Depends on what you want to do.

    If you're using the title to select one abstract, and then running regexps on just one abstract, then depending on how many abstracts you're dealing with, there's probably a better way.

    If you're running regexps on each one of the abstracts in sequence and that's all you do, flat files are about as fast as you get, IMHO.

    Reading your mind between the lines, I rather suspect the former rather than the latter, in which case for a sufficiently large number of articles (I dunno, a thousand or more?), you can speed things up by an order of magnitude or more.

    If that's so, then you likely want to separate things into a list of authors, a list of journals, and a list of articles with pointers to authors and journals. A setup like this would let you query articles by author or journal, for instance.

    If that appeals to you, then a database may be in order. Popular open source database packages include:

    Perl Modules for database access start with the ubiquitous DBI, ranging all the way up to relational DB backed object frameworks like SPOPS and Alzabo. My personal favorite for intuitive ease of use is Class::DBI, which maps object classes to database tables on a one-to-one basis. It's not perfect (yet) but I find myself more productive using it.

    OTOH, you may want to keep things all in a single file as you do now, but speed up your searches, in which case you may want to use something like Berkeley DB.

      MySQL: the speed king

      Since when? According to who?

        Sorry ... don't want to start a religious war.
Re: Basic data storage question
by daeve (Deacon) on Jul 30, 2003 at 03:21 UTC
Re: Basic data storage question
by sauoq (Abbot) on Jul 30, 2003 at 02:27 UTC
    Any other suggestions?

    A relational database.

    -sauoq
    "My two cents aren't worth a dime.";
    

      Could you elaborate on that? Should I be exploring dbm files?

        No! You must use PostgreSQL. It is the only way! ;-)

        Repeat after me: PostgreSQL is good... PostgreSQL is good...

        Should I be exploring dbm files?

        Yes, you should explore them. I don't think that they will prove to be the best choice in your situation though. They aren't relational and your data seems to imply that flexibility in accessing it should be a requirement. Just the same, a DBM would probably be better than your current scheme. It would also be relatively easy to implement.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Basic data storage question
by sulfericacid (Deacon) on Jul 30, 2003 at 03:52 UTC
    I may get downvoted for this (rather unfairly I'd say), but if you haven't used any databases before, I'd suggest you look at SDBM or DB_File. Others will argue on how powerful or efficient these are but they're really nice for someone who doesn't have much database background and doesn't need something as complex as MySQL.

    Using these, I join all my data pieces with :: and split when I need them. IMHO, these are the easiest to work with if you're just starting with databases.

    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid

Re: Basic data storage question
by Anonymous Monk on Jul 30, 2003 at 02:39 UTC
    any suggestions?

    As a matter of fact, yes :)

    1. Visit PostgreSQL and take a look around.
    2. If possible, acquire a copy of Practical PostgreSQL. Read it.
    3. Check out some online tutorials, read some public discussions on the subject.
    4. Start your project. After you've done that, finish it ;-).
    5. Stand in awe of your newfound enlightenment.

    Oh, and remember to use unique identifiers for the articles, it's more efficient and you don't risk having duplicates.