Very Large CSV Filter

Locke-Eros has asked for the wisdom of the Perl Monks concerning the following question:

So I wrote a script that takes a large data set (csv) and filters out records based on another large data set (also csv). they both match based on the email field and my script accounts for the positioning of it. the final data set is outputted to a 3rd file. Caveat: I'm a newbie.


#!/usr/bin/perl -w
use strict;
use Text::CSV;
use Tie::File;

#load all args to vars
my($userCSV, $supCSV, $ufield, $sfield, $output) = @ARGV;

#open each file into array for editing
tie my(@userList), 'Tie::File', $userCSV or die;
tie my(@supList), 'Tie::File', $supCSV or die;
tie my(@output), 'Tie::File', $output or die;

#load up CSV methods into vars
my $uCSV = Text::CSV->new();
my $sCSV = Text::CSV->new();

#each line iterated into the line var
foreach my $line (@userList)
{
    #convert line to CSV constructor
    if($uCSV->parse($line))
    {
        #open a new var and load csv values into it
        my @userCols = $uCSV->fields();
        #run a check against the suppression
        if(check($userCols[$ufield]))
        {
            #write to output
            push @output, $line;
        }
    }
}

#check if email exists in suppresion list
sub check
{
    #var parsed set
    my($eCheck) = $_[0];
    #loop through suppression list and get each line as a var
    foreach my $line (@supList)
    {
        #convert the var into a CSV format
        if($sCSV->parse($line))
        {
            #convert the 
            my @supCols = $sCSV->fields();
            if($eCheck eq $supCols[$sfield])
            {
                print "Busted\n";
                return 0;
            }
        }
    }
    print "Looped\n";
    return 1;
}
[download]

So here's my problem. I run it against very small files for both inputs and the output comes out correctly, when I run it production state with the very large files, it does whatever it does efficiently but the output file shows nothing. I'm not good at error checking, -w and strict are all i know. Is there a more efficient way to achieve this? is there an error in my code that made my test case a fluke success? any and all guidance is welcome. Thanks.

Comment on Very Large CSV Filter Download Code

Replies are listed 'Best First'.
Re: Very Large CSV Filter by Tux (Canon) on Jul 14, 2011 at 17:30 UTC
Did you consider using DBD::CSV? Secondly, I note that you use `parse` on lines that are tied in Tie::File, maybe Tie::CSV_File or Text::CSV_XS (much faster than Text::CSV) plus Tie::Hash::DBD using DBD::CSV would make your problem easier to address. Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: Very Large CSV Filter by ww (Archbishop) on Jul 14, 2011 at 16:47 UTC
"-w and strict are all i know." Time to add another weapon to your armory: `use Diagnostics` re which you'll do well to see `perldoc perldoc Diagnostics` (with careful attention to the discussion of `splain`). Re "test case" and "fluke"-- relying on a single test case is better than non-testing, but if you have any reason to suspect something odd about your "small files" then you'll do well to expand your testing with another data set. OTOH, given your decision to open all three files involved and your description of the (production?) files as "(v)ery large," you might want to check memory use on the production machines (<= WAG; not supported by data other than code view).	[reply] [d/l] [select]
Re: Very Large CSV Filter by MidLifeXis (Monsignor) on Jul 14, 2011 at 17:56 UTC
In addition to the recommendation for using a DBD module (allows a JOIN or MINUS on the files, which are mapped as tables into the DBD module), have you considered doing this at the Unix prompt level? sort and join could accomplish what you are attempting, as long as the CSV is simple enough (no embedded / quoted newlines or delimiters). --MidLifeXis	[reply]
Re: Very Large CSV Filter by Locke-Eros (Initiate) on Jul 14, 2011 at 20:22 UTC
You guys rock! I'll look into DBD::CSV and Text::CSV_XS tonight. Can someone elaborate on the sort and join idea? I'll post what happens later tonight.	[reply]
Re^2: Very Large CSV Filter by MidLifeXis (Monsignor) on Jul 15, 2011 at 13:14 UTC
`join -t , -j1 N1 -j2 N2 F1 F2` would take two comma separated files, F1 and F2, sorted on the email field (N1 and N2), and join them. See the man pages mentioned before for other flags that can be used. It has been said that you can build a pretty complete DB with just the common unix commands (grep, sed, join, sort, uniq). Not that it would be overly fast, but it could be done. Since `perl` incorporates parts of each of these utilities, I would probably reach for a perl solution for anything beyond a trivial need. --MidLifeXis	[reply] [d/l] [select]
Re: Very Large CSV Filter by Locke-Eros (Initiate) on Jul 15, 2011 at 15:08 UTC
I've run into a problem. each of those modules ask for me to load the file with "open". each of these files are 2GBs each. opening them would overload my ram. This is why I used Tie:File. anyone have a solution that doesn't require me to read the entire file to memory and still be able to match every line of one csv against every line of the other? that is 1,000,000,000 records, and have a script delete the records from that file where it matches any of the records of a second file with about 1,000,000 records? efficiency is key too.(memory = 4GB)	[reply]
Re^2: Very Large CSV Filter by MidLifeXis (Monsignor) on Jul 15, 2011 at 17:31 UTC
If the files are in the proper order (sorted by email address, if I remember correctly), you can use a file merge type solution. Open your output file (O) Open your input file (I), and your deletion file (D) Read first record of I and D while I is not at end of file read next record from D while D < I and D is not at the end of file send I -> O unless I == D close all files Conversion into perl is left as an exercise to the reader. sort (OS level), sort (Perl level), open, close, eof, and perlop are all potentially helpful in this task. Be aware that you are dealing with 1 billion records, so it is likely, depending on the complexity of the records and comparison, that the sort or filter step could take a while. Benefits: only one record from each of the input and deletion files is in memory at a time. --MidLifeXis	[reply]