Re^2: Analysing a (binary) string.

Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.

That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :)

These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description.

The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing,

I don't think that matters for my purpose, but it is good to know.

The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.

Yes, the possibility of errors is the reason for needing a new approach.

And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough.

This is the code I used based on his idea:

#! perl -slw
use strict;
use Data::Dump qw[ pp ];

open I, '<:raw', $ARGV[0] or die $!;
my $s = do{ local $/; <I> };
close I;
$|++; print length $s;

my @c; ++$c[ ord $1 ] while $s =~ m[(.)]g;
pp \@c; scalar <STDIN>;

for( my $i = $#c; $i; --$i ) {
    next unless $c[ $i ] > 2;
    my @p; $p[ @p ] = $-[0] while $s =~ m[${ \chr( $i )}]g;
    my @spacing = map{ $p[ $_ + 1 ] - $p[ $_ ]   } 0 .. $#p-1;
    print ">>@spacing"; scalar <STDIN>;
}
[download]

Which produces (severely cut down for posting):

5644800
[
  undef,
  1455300,
  1455300,
  1656200,
  386120,
  429240,
  184240,
  56840,
  11760,
  7840,
  undef,
  1960,
]


>>2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 ... for 
+1960 values.

Use of uninitialized value within @c in numeric gt (>) at C:\
>>685 644 813 638 813 644 685 838 685 644 813 638 813 644 685 ... for 
+7840 values

>>618 26 717 739 8 739 717 26 618 775 2 775 618 26 717 739 8 739 717 2
+6 618 775 2 775 ... for 11760 values.
Terminating on signal SIGINT(2)
[download]

The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760.

That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760;

And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760.

And with 3 confirmations, I know the repetition size.

Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: Analysing a (binary) string. Select or Download Code

Replies are listed 'Best First'.
Re^3: Analysing a (binary) string. (welcome) by tye (Sage) on Jun 28, 2013 at 15:12 UTC
I'm glad it turned out to be so helpful. You're welcome, BrowserUk. - tye	[reply]
Re^3: Analysing a (binary) string. by hdb (Monsignor) on Jul 03, 2013 at 08:57 UTC
Even though you declared this problem as solved, I have been thinking about the skip ahead technique. It could be used if you know a minimum length of the pattern, say at least 100, and you know that the number of errors is small compared to the length of the pattern, say at most 5. You could then try finding a repeat of the first 100 chars somewhere later in the string using some tolerant matching technique, say at position 5000, and then lengthen the pattern to the first 5000 characters. This will not be as efficient as the original skip ahead but might speed up things somewhat. Not sure how it compares to the frequency analysis.	[reply] [d/l]
Re^4: Analysing a (binary) string. by BrowserUk (Patriarch) on Jul 03, 2013 at 09:17 UTC
Not sure how it compares to the frequency analysis. In the generic, that might be worth pursuing though my (fairly extensive) experience of fuzzy matching techniques is that they are always slow. For the specific case of my data, the frequency analysis proved so simple and efficient that it wasn't even worth timing it. My first attempt at the code worked first time and found all the reps that were there to be found in just a few seconds. There was no reason (for me) to pursue this further. If my dataset had not proven to be so amenable to the frequency analysis method -- near perfect inverse log distribution -- I might still be looking for another method, but I have plenty of other nuts to crack with this particular dataset :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]