in reply to high speed checksum for video finger printing?

Use the filesize to seed a random number generator, and then read 100 random 4- or 8-byte chunks from the file, stick'em together and checksum them.

The odds of duplicates are billions to 1.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re: high speed checksum for video finger printing?

Replies are listed 'Best First'.
Re^2: high speed checksum for video finger printing?
by faber (Acolyte) on Feb 05, 2012 at 00:01 UTC
    Ah yes, This is a great idea and could be very useful for the right types of data management cases. I think I'm going to do this, likely call it as suggested File::Fingerprint::Huge if no one has anything similar to this already.
      if no one has anything similar to this already.

      Nothing I've seen, so go for it.

      My suggestion would be to use Math::Random::MT as the PRNG. It is portable and reproducible cross-platform.

      Then something like:

      use Math::Random::MT qw[ rand srand ]; use Digest::CRC qw[ crc64 ]; sub fingerPrintFile{ my $file = shift; my $filesize = -s( $file ); srand $filesize; open my $fh, "<', $file or die $!; ## assuming CRC-64 my $chunks = int( $filesize / 8 ) - 1; ## Added sort per RichardK's suggestion below. my @posns = sort{ $a <=> $b } map 8*int( rand $chunks ), 1 .. 100; my $rawSample = join '', map{ seek $fh, $_, 0; read( $fh, my $chun +k, 8 ); $chunk } @posns; close $fh; return crc64( $rawSample ); }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I think I would sort the chunk positions first, then you would read the file in only one direction. As you've only got a small number of blocks the sort won't be costly, and then two blocks close together may fall in the same read ahead window.

        It just might improve performance on some file systems/OSes