artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I have developed a web application to display items on the webpage from number of text files (around 200). I don't have control over these files. I have to sort the data obtained from files (20,000 records) and do other processing. The data needs to be current. The results I display is only few at a time ( 20) per page. These currently takes lot of time. Also when I display 2nd page, it is all over again. I also provide the search results from these files with similar mechanism. Obviously this is not very effective mechanism. We are not using database at this point.

I am looking for some good ideas.

Update: (after 25 minutes) Is there any solution if data doesn't need to be absolutely current when user goes to second page of listing or search results? This is highly personalized data, which I found out that, gets updated every hour.

Thanks.

Replies are listed 'Best First'.
Re: CGI Display and Processes.
by kappa (Chaplain) on Nov 19, 2004 at 16:06 UTC

    If you are not the one to decide switching business to a database, then you'd probably better use database anyway, but only for yourself. Import the data into a database from those files before even thinking about efficient algorithms. You'll save yourself a good amount of hair and blood.

    You'll need some sort of timely update procedure, e.g. a croned incremental import script. That's the way.

    chasing your update: Whoa, you have a whole hour when your data is stable! Import it into a database right after update and happily serve user search and browse requests from the database fully powered by sql or whatever.

Re: CGI Display and Processes.
by ikegami (Patriarch) on Nov 19, 2004 at 16:05 UTC

    This problem screams to be rescued by a database.

Re: CGI Display and Processes.
by BrowserUk (Patriarch) on Nov 19, 2004 at 16:48 UTC

    I also think that this screams for the use of a DB (SQL or Berkeley type). But if you don't control the processe(s) generating the 200 files and you cannot arrange for them to update the DB directly, then you would still require some kind of seperate process to monitor the files and import the data as it changes.

    If you're running on Win32, then Win32::ChangeNotify could form the basis of a monitoring process that runs permanently and discovers when the files change.

    Were I forced to do it without a DB, I would have the monitoring process mainatain the 20,000 items in memory in one of the "ordered hash" modules on CPAN.

    Each time a file changes, it would read the file updating it internal cache. It would then write the ordered data to a timestamped file for the CGI to read.

    Each time the CGI needed data it would use that from the latest timestamped file. As the CGI would only read the file, the scope for conflicts is minimal.


    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      If you're running on Win32, then Win32::ChangeNotify could form the basis of a monitoring process that runs permanently and discovers when the files change.

      In Linux you can use SGI::FAM (File Access Monitor) - it is not only for SGI :)

        Is it? Looking at the source one can think that it is strictly for SGI FAM...
Re: CGI Display and Processes.
by Happy-the-monk (Canon) on Nov 19, 2004 at 16:09 UTC

    The data needs to be current... when I display 2nd page, it is all over again.

    Well, your dilemma really isn't a Perl problem:
    When your data needs to be so absolutely current, you mustn't cache it.

    Cacheing or using some other way than reading and computing lots of text files might reduce the stress on generating the data.

    Improve hardware and the algorithm if you can and push towards migrating to a more viable solution.
    Some kind of cache or a fast database, whatever might do the trick.

    Cheers, Sören