rochlin has asked for the wisdom of the Perl Monks concerning the following question:

I have a wav format mono audio file and I have a second file which contains a small clip from that file (so the second file is a much smaller wav file). All I need to know is the approximate start points of the clip within the larger audio file (the clip may be repeated --I would like the byte position of each occurrence.) I'm looking for particular sounds in a speach file

My natural thought was to do a regexp in Perl. I'm running activestate Perl 5.8 on Windows, so I know I have to use binmode to deal with binary files. Some trial and error revealed that to do anything with a regexp with a binary search I have to escape all of the interesting characters in the clip I'm searching for. Here's some code just to try to get the basic concept working:

#!/usr/bin/perl -w use strict; my $wordsound = 'balks.wav'; my $lettersound = 'a.wav'; my $letter; open (WORDSOUND, $wordsound) or die "can't open $wordsound: $!"; binmode (WORDSOUND); open (LETTERSOUND, $lettersound) or die "can't open $lettersound: $!"; binmode (LETTERSOUND); #actual data starts on byte 4097 seek (LETTERSOUND, 4097,0); #100 bytes ought to be OK for identifying the clip my $error = read (LETTERSOUND, $letter, 100); my $word = <WORDSOUND>; #escape all potentially nasty characters with \'s #I'm not sure if I'm escaping everything I need to (?) $letter =~ s/(\\|\||\(|\)|\[|\{|\^|\$|\*|\+|\?|\.)/\\$&/gsm; $word =~ /$letter/gms; print pos($word);

All I get is
"Use of uninitialized value in print at parsesound.pl line 25, <WORDSOUND> line 1."

If anyone has ever looked for a binary string within a binary file and has any code, that'd be great. I would have thought it would be a pretty routine thing, but googling didn't yield much. I didn't see it in the Perl Monks search either. Hmmmm.

Replies are listed 'Best First'.
Re: searching for a binary string in a binary file
by Zaxo (Archbishop) on Dec 12, 2004 at 00:55 UTC

    Since you're looking for the offset of an exact match, the index function is ideal for you. There's no need to use an escaped regex (which could be done with /\Q$letter\E/).

    Also, with a binary file, reading with the diamond op is chancy. You don't know where the audio may have a crlf pair. You clearly want to slurp the whole file into $word, so undefine $/ to make diamond do that.

    my $word = do { local $/; <WORDSOUND> }; my $offset = index $word, $letter;
    If you need more sophisticated analysis, take a look at PDL.

    After Compline,
    Zaxo

      You clearly want to slurp the whole file
      What if the file is hundreds of megs? Here's a little something I whipped up:
      use strict; use warnings; my $buf_size = 16_384; open(my $big, shift) or die; open(my $small, shift) or die; my $search_string; { local $/; $search_string = <$small>; } my $buffer = ""; my $pos = 0; while(sysread($big, $buffer, $buf_size, length($buffer) ) ) { if ( (my $index = index($buffer, $search_string)) != -1) { print "FOUND! found the search string at position ", $pos + $ind +ex; exit; } $buffer = substr($buffer, int(length($buffer)/2)); $pos += length($buffer); } print "search string not found";

      I'll grant that this also falls prey to the same problem that your has if the smaller file is itself large, but I think that in general this scales much better.

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

      f
Re: searching for a binary string in a binary file
by Mr. Muskrat (Canon) on Dec 13, 2004 at 04:00 UTC