in reply to Very Large CSV Filter

I've run into a problem. each of those modules ask for me to load the file with "open". each of these files are 2GBs each. opening them would overload my ram. This is why I used Tie:File. anyone have a solution that doesn't require me to read the entire file to memory and still be able to match every line of one csv against every line of the other? that is 1,000,000,000 records, and have a script delete the records from that file where it matches any of the records of a second file with about 1,000,000 records? efficiency is key too.(memory = 4GB)

Replies are listed 'Best First'.
Re^2: Very Large CSV Filter
by MidLifeXis (Monsignor) on Jul 15, 2011 at 17:31 UTC

    If the files are in the proper order (sorted by email address, if I remember correctly), you can use a file merge type solution.

    • Open your output file (O)
    • Open your input file (I), and your deletion file (D)
    • Read first record of I and D
    • while I is not at end of file
      • read next record from D while D < I and D is not at the end of file
      • send I -> O unless I == D
    • close all files

    Conversion into perl is left as an exercise to the reader.

    sort (OS level), sort (Perl level), open, close, eof, and perlop are all potentially helpful in this task.

    Be aware that you are dealing with 1 billion records, so it is likely, depending on the complexity of the records and comparison, that the sort or filter step could take a while.

    Benefits: only one record from each of the input and deletion files is in memory at a time.

    --MidLifeXis