Re: Best way to find patterns in csv file?

A database like DBD::SQLite is pretty easy to use and setup and won't require sweeping through all the rows every time if you setup indexes. It seems it would do most of what you wanted to do. If you know some of the columns will always be numeric you can use the appropriate datatypes. Here I've just made a 4 data column table. Note that you can also do some tweaking to make SQLite a little faster by delaying table writes, etc - There is info on the module page. Here

#!/usr/bin/perl -w
use strict;
use DBI;

# this is just silly code to show how one might encode the
# the columns with values we want to match on, of course you
# might want fancier things like ranges, etc so consider it
# just an example
# so the 1st pattern should match rows where 10 is the 
# 1st data column and 20 in the second

my @patterns = ([10,20,undef,undef],
        [undef,undef,'green',undef]);

my $dbname = 'testdb';
unlink($dbname); # this is for testing so make sure we start 
                 # with clean file every time
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname","","",
               );

my $table = $dbh->do(<<EOF
CREATE TABLE record (
record_id int(10) PRIMARY KEY,
datum1   int,
datum2   int,
datum3   varchar(64),
datum4   double(10,3)
                    );  
KEY i_d1 (datum1);
KEY i_d2 (datum2);
KEY i_d3 (datum3);
KEY i_d4 (datum4);
EOF
          );
my $stmt = $dbh->prepare("INSERT INTO record VALUES (?,?,?,?,?)");
# data is from the stuff at the bottom, just for
# example so the script can be selfcontained
while(<DATA>){
    chomp;
    my ($id,@datum) = split(/,/,$_);
    $stmt->execute($id,@datum);
}


for my $pat ( @patterns ) {
    my ($i,$str,@vals) = (1);
    for my $p ( @$pat ) {
    if( defined $p ) {
        $str .= ' AND ' if defined $str;
        $str .= 'datum'.$i.'=?';
        push @vals, $p;
    }
    $i++;
    }
    if( $str && @vals ) { # skip empty pattern rows
    my $query = $dbh->prepare("SELECT * FROM record WHERE $str");
    $query->execute(@vals);
    while( my $r = $query->fetchrow_arrayref ) {
        print join(",",@$r),"\n"; 
    }
    }
    print "--\n"; # some sort of pattern search delimiter
}

# this is just embedding example data so this can be a 
# self contained example.
__DATA__
1,20,40,blue,800.0
2,10,20,yellow,612.23
3,1,17,green,9.89
4,77,50,red,5.102
5,8,33,orange,34.21
6,10,20,silver,998.8
[download]

Comment on Re: Best way to find patterns in csv file? Download Code

Replies are listed 'Best First'.
Re^2: Best way to find patterns in csv file? by Thilosophy (Curate) on Dec 01, 2004 at 10:55 UTC
I second the idea of using a database (and DBD::SQLite unless you already have a "real" RDBMS readily available). 1.5 million rows is a lot of stuff after all. A database will pay off especially if you plan to do this query on a regular basis. Now, if you really want to work on the CSV, I personally would run it through grep first. Suppose you need datum1 = 99 and datum2 = 130 and datum4=5. You can cut down the size of the file by throwing out all lines that do not contain "99" and "130" and "5", which is probably a lot. `grep 99 data.csv \| grep 130 \| grep 5 > filtered.csv` [download] After this preprocessing, a CSV module from CPAN might be able to handle the rest. Of course, this simple grep only works if you have AND queries (not OR). But you do, right? Update: I just saw that you have 35000 patterns to check. Forget the CSV file, use a database (and index the columns that appear in most patterns)	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Best way to find patterns in csv file?
by Thilosophy (Curate) on Dec 01, 2004 at 10:55 UTC

Now, if you really want to work on the CSV, I personally would run it through grep first. Suppose you need datum1 = 99 and datum2 = 130 and datum4=5. You can cut down the size of the file by throwing out all lines that do not contain "99" and "130" and "5", which is probably a lot.

grep 99 data.csv | grep 130 | grep 5 > filtered.csv
[download]

Of course, this simple grep only works if you have AND queries (not OR). But you do, right?

Update: I just saw that you have 35000 patterns to check. Forget the CSV file, use a database (and index the columns that appear in most patterns)

[reply]
[d/l]