Locke-Eros has asked for the wisdom of the Perl Monks concerning the following question:

So I wrote a script that takes a large data set (csv) and filters out records based on another large data set (also csv). they both match based on the email field and my script accounts for the positioning of it. the final data set is outputted to a 3rd file. Caveat: I'm a newbie.

#!/usr/bin/perl -w use strict; use Text::CSV; use Tie::File; #load all args to vars my($userCSV, $supCSV, $ufield, $sfield, $output) = @ARGV; #open each file into array for editing tie my(@userList), 'Tie::File', $userCSV or die; tie my(@supList), 'Tie::File', $supCSV or die; tie my(@output), 'Tie::File', $output or die; #load up CSV methods into vars my $uCSV = Text::CSV->new(); my $sCSV = Text::CSV->new(); #each line iterated into the line var foreach my $line (@userList) { #convert line to CSV constructor if($uCSV->parse($line)) { #open a new var and load csv values into it my @userCols = $uCSV->fields(); #run a check against the suppression if(check($userCols[$ufield])) { #write to output push @output, $line; } } } #check if email exists in suppresion list sub check { #var parsed set my($eCheck) = $_[0]; #loop through suppression list and get each line as a var foreach my $line (@supList) { #convert the var into a CSV format if($sCSV->parse($line)) { #convert the my @supCols = $sCSV->fields(); if($eCheck eq $supCols[$sfield]) { print "Busted\n"; return 0; } } } print "Looped\n"; return 1; }

So here's my problem. I run it against very small files for both inputs and the output comes out correctly, when I run it production state with the very large files, it does whatever it does efficiently but the output file shows nothing. I'm not good at error checking, -w and strict are all i know. Is there a more efficient way to achieve this? is there an error in my code that made my test case a fluke success? any and all guidance is welcome. Thanks.

Replies are listed 'Best First'.
Re: Very Large CSV Filter
by Tux (Canon) on Jul 14, 2011 at 17:30 UTC
Re: Very Large CSV Filter
by ww (Archbishop) on Jul 14, 2011 at 16:47 UTC
    "-w and strict are all i know."
    Time to add another weapon to your armory:

    use Diagnostics re which you'll do well to see perldoc perldoc Diagnostics (with careful attention to the discussion of splain).

    Re "test case" and "fluke"-- relying on a single test case is better than non-testing, but if you have any reason to suspect something odd about your "small files" then you'll do well to expand your testing with another data set.

    OTOH, given your decision to open all three files involved and your description of the (production?) files as "(v)ery large," you might want to check memory use on the production machines (<= WAG; not supported by data other than code view).

Re: Very Large CSV Filter
by MidLifeXis (Monsignor) on Jul 14, 2011 at 17:56 UTC

    In addition to the recommendation for using a DBD module (allows a JOIN or MINUS on the files, which are mapped as tables into the DBD module), have you considered doing this at the Unix prompt level? sort and join could accomplish what you are attempting, as long as the CSV is simple enough (no embedded / quoted newlines or delimiters).

    --MidLifeXis

Re: Very Large CSV Filter
by Locke-Eros (Initiate) on Jul 14, 2011 at 20:22 UTC
    You guys rock! I'll look into DBD::CSV and Text::CSV_XS tonight. Can someone elaborate on the sort and join idea? I'll post what happens later tonight.

      join -t , -j1 N1 -j2 N2 F1 F2 would take two comma separated files, F1 and F2, sorted on the email field (N1 and N2), and join them. See the man pages mentioned before for other flags that can be used.

      It has been said that you can build a pretty complete DB with just the common unix commands (grep, sed, join, sort, uniq). Not that it would be overly fast, but it could be done. Since perl incorporates parts of each of these utilities, I would probably reach for a perl solution for anything beyond a trivial need.

      --MidLifeXis

Re: Very Large CSV Filter
by Locke-Eros (Initiate) on Jul 15, 2011 at 15:08 UTC
    I've run into a problem. each of those modules ask for me to load the file with "open". each of these files are 2GBs each. opening them would overload my ram. This is why I used Tie:File. anyone have a solution that doesn't require me to read the entire file to memory and still be able to match every line of one csv against every line of the other? that is 1,000,000,000 records, and have a script delete the records from that file where it matches any of the records of a second file with about 1,000,000 records? efficiency is key too.(memory = 4GB)

      If the files are in the proper order (sorted by email address, if I remember correctly), you can use a file merge type solution.

      • Open your output file (O)
      • Open your input file (I), and your deletion file (D)
      • Read first record of I and D
      • while I is not at end of file
        • read next record from D while D < I and D is not at the end of file
        • send I -> O unless I == D
      • close all files

      Conversion into perl is left as an exercise to the reader.

      sort (OS level), sort (Perl level), open, close, eof, and perlop are all potentially helpful in this task.

      Be aware that you are dealing with 1 billion records, so it is likely, depending on the complexity of the records and comparison, that the sort or filter step could take a while.

      Benefits: only one record from each of the input and deletion files is in memory at a time.

      --MidLifeXis