diyaz has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to add a long DNA sequence into my database. Reading some other threads here, it was suggested to use CRC32 as a way to implement a unique constraint on fields where the length of the string is too long for unique key.

I am trying to check if a DNA sequence already exists in my database by converting it with String::CRC32 into CRC32 hash. However, I can't seem to use pack or unpack to properly store the data type for comparison to my database.
my $crc32=crc32("$sequence"); #check crc32 of sequence my $th=$dbh->prepare(qq(SELECT COUNT(1) FROM Sequence WHERE Seq_Checks +um=$crc32)); #check if it already exists $th->execute(); my $test=$th->fetch()->[0]; print "$test\n"; #output if it exists
This gives me a segmentation fault this is mySQL 5.7 and perl 5.22.1

UPDATE: Good lord. After reading a lot about the CRC32 function in SQL. I learned that it spits out a 32bit unsigned integer which is basically 0-4million something. So this whole time I assumed it was giving me binary so I was storing it binary. I switched it to unsigned int size 10 and that is the exact size of 32bit. So I don't have to mess with binary values.

Replies are listed 'Best First'.
Re: checking mySQL database for unique CRC32 as proxy for long string
by hippo (Archbishop) on Jan 22, 2016 at 22:54 UTC
    my $th=$dbh->prepare(qq(SELECT COUNT(1) FROM Sequence WHERE Seq_Checksum=$crc32));

    Don't do that - use placeholders instead.

    This gives me a segmentation fault

    That should not happen. Please provide an SSCCE.

       my database [...]
      This gives me a segmentation fault

      Wow. Did you report that back to wherever that database was manufactured?

      (For instance, if it were PostgreSQL they would want to know about such a awful bug.)

      (And, by the way, if it were postgres you could use md5 as a db-function, no need for external checksumming.)

      I commented out stuff I was just testing. I don't know what placeholders are in this context. I think the tutorial I am reading online has a section on it that I am not up to.
      #!\C:\Strawberry\perl\bin\perl use strict; use warnings; use DBI; use DBD::mysql; use String::CRC32; #DATA SOURCE NAME my $dsn = "dbi:mysql:database=$database;host=$host;port=$port"; #PERL DBI CONNECT my $dbh=DBI->connect($dsn,$user,$pw, { PrintError => 0, RaiseError => 1, AutoCommit => 1, }) or die $DBI::errstr; #Read input file to load into DB open(INFILE, "<$ARGV[0]") or die "cannot open file:$!\n"; <INFILE>; #skip header for(<INFILE>){ #split line my @line=split("\t", $_); #check correct number of columns if (scalar(@line)>=11) { #parse columns my $chr=$line[0]; my $pos=$line[1]; my $sequence=$line[2]; #insert sequence my $crc32=crc32("$sequence"); #check crc32 of sequence print unpack($crc32)."\n"; my $th=$dbh->prepare(qq(SELECT COUNT(1) FROM Sequence WHERE Se +q_Checksum=$crc32)); $th->execute(); my $test=$th->fetch()->[0]; print "$test\n"; #if ($th->fetch()->[0]) { # print "Sequence already in database!\n"; #} #else { # print "New sequence detected, inserting into database wit +h unique checksum.\n"; # $dbh->do("INSERT INTO Sequence(Sequence) VALUES($sequence +)"); # $dbh->do("INSERT INTO Sequence(Seq_Checksum) VALUES($crc3 +2)"); # #} # } }
        ah I see how to do placeholder. That's a neat trick.
        my $th=$dbh->prepare(qq(SELECT COUNT(1) FROM Sequence WHERE Seq_Checks +um=?)); $th->execute($crc32);
Re: checking mySQL database for unique CRC32 as proxy for long string
by GotToBTru (Prior) on Jan 22, 2016 at 22:29 UTC

    How large of a string are you using here? Perhaps you might start with smaller strings while you figure the algorithm out. Also, at what point does it fault?

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

      about 7000 characters. I guess technically if I am manually checking if the item already exists then I don't need the checksum because I could just pull it to perl and check, but I guess checking a few bytes of binary is quicker than 7000char for each row.