kayj has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,

I have a file 1 with 300,000 patients , each row has the subject information and the first column is the subject ID. I need to sort this file based on a certain criterion using file2 where file 2 has the list of the subject ID. For example, if file 2 looks like

Sub35 Sub71 Sub89 . .

Then I have to go to file 1 , search for sub35 and print the record for sub35 and the following 2 subjects in file 1. then search for sub71 in file 1 and print out the record for sub71 and the 2 subjects following sub71 in file 1.

The way I wrote the script is as follows:

Loop through file 2 using while loop and get the subject number. Loop over file 1 from the beginning of the file to the end using a for loop, split each row and find the subject, then print out the line and the following two lines.

The problem with this approach is that the program took over 24 hours to run. Is there a way to cut the processing time?

Thanks for your help,

Replies are listed 'Best First'.
Re: How to cut down the running time of my program?
by ikegami (Patriarch) on Oct 27, 2009 at 18:38 UTC

    Trivial using a database

    SELECT * FROM File1 WHERE id IN ( SELECT id FROM File2 )

    But really, it's also trivial using Perl.

    1. Load ids from second file into a hash
    2. For each record of File1,
      1. Extract the id,
      2. If the id is in the hash from the second file,
        1. Print the record.

    Either way, you read and process each line of the first file once, not once per line in the second file.

Re: How to cut down the running time of my program?
by davido (Cardinal) on Oct 27, 2009 at 18:48 UTC

    I think the easiest solution sounds like it would be to pull the master file into a database. This may sound daunting, but DBD::SQLite is very easy to install and use, and it creates a database file that is pretty much standalone. You don't have to install a heavy-weight database to use SQLite; it's self-contained.

    Your initial conversion to the database might take awhile. But it won't take 24 hours! Depending on your hardware and the size of your total file I doubt it could exceed a few hours. And your queries will be MUCH quicker.

    Your current solution is running in O(n^2) time, if I'm not mistaken. That's fairly inefficient. To speed things up, you need to know the start points of each record in the file so that you can quickly jump to that record. The easiest way to do that is to let a database deal with the mechanics for you. But other solutions would be to create an index file that could be pulled into a hash of indices and offsets so that you can quickly seek to the proper location in the master file. Or you could forgo the offsets if your master file uses a uniform record length, in which case your index hash could contain the indices and the record number.

    But those solutions are really just you implementing your own version of a database. Since that's already been done, you may as well take advantage of what's already available. SQLite is an ideal solution for lightweight database work.


    Dave

Re: How to cut down the running time of my program?
by Argel (Prior) on Oct 28, 2009 at 00:44 UTC
    A database is probably the best route to take, but if that's not an option then you need to index file1 yourself. You might want to look into B-Trees. Mastering Algorthims with Perl covers them briefly on pages 169-171. This assumes that file1 is sorted by the suject ID (i.e. in ascending or descending order).

    Elda Taluta; Sarks Sark; Ark Arks

Re: How to cut down the running time of my program?
by educated_foo (Vicar) on Oct 28, 2009 at 02:35 UTC
    The other comments mostly boil down to "use a database," but you only have 300k records. You want to ask yourself whether or not they will all fit in memory; if so, you can just read file 1 into a hash. If you have a machine with 2G memory available to your script, then you have about 6k available per record.

      You're suggesting doing it the wrong way around. If the second file contains just the IDs, say 8 chars per, then by loading the second file into the hash, if every single one of the 300,000 ids in the first (larger) file, was also in the second file, then the hash would only require 13,899,476 bytes of ram. (That's on a 64-bit OS, probably much less on a 32-bit.).

      The OP can then proocess the 'first' file line by line and print them if the ID exists within the hash constructed from the second file.

      Total processing time required: ~23 seconds.

      And that's a damn sight faster than you could load the larger file into the DB, and approx 1% of the time required for a full RDBMS (pgsql) to perform the query for just 2000 ids.

      (R)DBs are a sledge-hammer for this nut...as with so many others for which they are routinely prescribed. If I had a hammer...I wouldn't bother to think!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.