in reply to Re: Best way to match a hash with large CSV file
in thread Best way to match a hash with large CSV file

The DB, if it is built with SQLite, will be so performant that it is great!

That is a very bold, some would say rash, claim. Given you appear not to understand the problem.

At some point in the run of a program the OP has a hash containing 5000 key/value pairs. At that same point in time on disk, there is a CSV file containing 120,000 lines of 10 fields.

The task is, given the hash and an integer (that represents a field within each record), to produce an output file on disk that contains that subset of the input file where field 1 (0-based) of the record exists as a key in the hash, with the value of the field identified by the number, as its value.

What you are suggesting is that in less than 6 seconds you can:

  1. Build an SQLite DB from the file of 120,000 records.
  2. Perform 5000 SQL queries -- against that 120,000 record DB -- of the form:
    select * from table where field1 eq ? and field? eq ?;
  3. Convert all the results records, from each of the 5000 queries, back to CSV records.
  4. Write them out to the new file.

    (For strict compliance, re-order the retrieved records into input file order before output.)

  5. (Delete the SQLite DB you created!)

Good luck!


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Best way to match a hash with large CSV file
by Marshall (Canon) on Nov 06, 2011 at 12:34 UTC
    1. Build an SQLite DB from the file of 120,000 records.
    No problem at all. This will take <1 second for a file with 120K lines.

    1.5) Index the DB. This step is not to be under estimated and was not mentioned.

    2. Perform 5000 SQL queries -- ...Ooooh...I did not say what kind of queries nor how many...But essentially, yes you can do a lot of these things per second, provided that you have indexed the DB in step 1.5 correctly. Some seconds will be required for this step.

    --Steps 3,4,5 are super trivial reformatting steps...

    3. Convert all the results records, from each of the 5000 queries, back to CSV records.

    4. Write them out to the new file.

    5. (Delete the SQLite DB you created!)

    Update:

    Creation of a DB with 266,551 lines from >600+ source files took 78 seconds.
    Fancy indexing took another 20 seconds.
    Basically 100 seconds and I'm "ready to roll" with 1/4 million lines.

    I have a whimp "Prescott" machine. This is more than a decade old.

      Seeing's believin'.

      Also, indexing a DB is a part of building it.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Ok, let's go! Give me a FTP drop site and I will give you a .zip file with data and code!

        I just ran a easy test for me with 818 files, 266,551 DB "rows", < 100 seconds to read all files, created DB, index DB in more ways from Sunday.