in reply to Re: Analysing a (binary) string.
in thread Analysing a (binary) string. (Solved)
Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.
That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :)
These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description.
The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing,
I don't think that matters for my purpose, but it is good to know.
The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.
Yes, the possibility of errors is the reason for needing a new approach.
And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough.
This is the code I used based on his idea:
#! perl -slw use strict; use Data::Dump qw[ pp ]; open I, '<:raw', $ARGV[0] or die $!; my $s = do{ local $/; <I> }; close I; $|++; print length $s; my @c; ++$c[ ord $1 ] while $s =~ m[(.)]g; pp \@c; scalar <STDIN>; for( my $i = $#c; $i; --$i ) { next unless $c[ $i ] > 2; my @p; $p[ @p ] = $-[0] while $s =~ m[${ \chr( $i )}]g; my @spacing = map{ $p[ $_ + 1 ] - $p[ $_ ] } 0 .. $#p-1; print ">>@spacing"; scalar <STDIN>; }
Which produces (severely cut down for posting):
5644800 [ undef, 1455300, 1455300, 1656200, 386120, 429240, 184240, 56840, 11760, 7840, undef, 1960, ] >>2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 ... for +1960 values. Use of uninitialized value within @c in numeric gt (>) at C:\ >>685 644 813 638 813 644 685 838 685 644 813 638 813 644 685 ... for +7840 values >>618 26 717 739 8 739 717 26 618 775 2 775 618 26 717 739 8 739 717 2 +6 618 775 2 775 ... for 11760 values. Terminating on signal SIGINT(2)
The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760.
That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760;
And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760.
And with 3 confirmations, I know the repetition size.
Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Analysing a (binary) string. (welcome)
by tye (Sage) on Jun 28, 2013 at 15:12 UTC | |
|
Re^3: Analysing a (binary) string.
by hdb (Monsignor) on Jul 03, 2013 at 08:57 UTC | |
by BrowserUk (Patriarch) on Jul 03, 2013 at 09:17 UTC |