in reply to String Compression Optimization (repeated concatenated subsequences)

If I've understood the problem, this is much less complex than the LZ case as your looking to replace sequential repeats rather sub-sequences anywhere in the string.

I won't swear the results of the below are optimal in all cases-- that's a huge task to verify, and damned hard to analyse--but on the various tests I've run it seems to a fair job pretty quickly.

#! perl -slw use strict; our $L ||= 200; our $SEED ||= 1; our $RANGE ||= "'a'..'c'"; #srand $SEED; my @chars = eval $RANGE; my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; my @rls; use constant { BASESEQ => 0, SAVED => 1, OFFSET => 2, LENGTH +=> 3, REPEATS => 4 }; ## Find runs an record them push @rls, [ $1, length $2, pos($s) - length( $1 ), length( $1 ) + length( $2 ), length( $2 ) / length( $1 ), ] while $s =~ m[(.+)(?=(\1{1,254}))]g; ## Eliminate overlaps deleting the ## lesser saving in each case. for my $p ( 0 .. $#rls ) { next unless defined $rls[ $p ]; for my $q ( $p+1 .. $#rls ) { next unless defined $rls[ $p ] and defined $rls[ $q ]; last if $rls[ $q ][ OFFSET ] > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] && $rls[ $q ][ OFFSET ] < $rls[ $p ][ OFFSET ] + $rls[ $p ][ +LENGTH ] and delete $rls[ $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] ? $p : $q ]; } } ## Remove the overlaps @rls = grep defined, @rls; ## Display the results print $s; print ' ' x $_->[ OFFSET ], uc( $_->[ BASESEQ ] ), $_->[ BASESEQ ] x $ +_->[ REPEATS ] for @rls; __END__

Some results (I used an input range of 'a'..'b' so as to produce a large number of overlapps for testing).

[ 2:44:41.87] P:\test>390333 -L=70 -RANGE='a'..'b' abaabbaaaabbaaaabaabaabaaaababaababbbbbaababbaabbaaaaabbbbaabbaababbba + AABBAAaabbaa AABaabaab AABABaabab Bbb ABBAabba Aaa BBAAbbaa Bbb [ 2:44:42.69] P:\test>390333 -L=70 -RANGE='a'..'b' abbbaaabbaabbbbbbbbbaaabbaaaaabbbabaaaabaaababaababbabbaaaaaabbbbbabbb + Bbb AABBaabb BBAAAbbaaa Aa Bbb AABABaabab Bb AAAaaa BBbb Bbb [ 2:44:43.55] P:\test>390333 -L=70 -RANGE='a'..'b' aabbbaaaabaabbbbbbbabaaabaaaaaabaabbbabaabbaabbaababbaaabaabaaabbababb + Aa Bbb AABaab BBbb BAAAbaaa AABaab BAABbaabbaab Bb AABaab Aaa BAba Bb [ 2:44:44.37] P:\test>390333 -L=70 -RANGE='a'..'b' aabbbbaaabbbabababbbabbbbbababababbaabaabaaaabaababbaabbabbabbbbaabbbb + Aa BBbb Aaa BAbaba BBbb BABAbaba BAAbaabaa AABaab ABBabbabb Bb Aa BBbb

UPDATED: Here's a somewhat improved version that recursively processes the string so that sub-runs rejected because they overlapped are reprocessed.

#! perl -slw use strict; use constant { BASESEQ => 0, SAVED => 1, OFFSET => 2, LENGTH => 3, REPEATS => 4 }; our $L ||= 200; our $RANGE ||= "'a'..'c'"; #srand $SEED if $SEED; sub findRuns { my $s = shift; my @rls; push @rls, [ $1, length $2, pos($s) - length( $1 ), length( $1 ) + length( $2 ), length( $2 ) / length( $1 ) + 1, ] while $s =~ m[ .*? ( [a-z]+ ) (?= ( \1{1,254} ) [^)]* (?: \( | $ ) ) ]gx; for my $p ( 0 .. $#rls ) { next unless defined $rls[ $p ]; for my $q ( $p+1 .. $#rls ) { next unless defined $rls[ $p ] and defined $rls[ $q ]; last if $rls[ $q ][ OFFSET ] > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; while( $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] && $rls[ $q ][ OFFSET ] < $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ] ) { if( $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] ) { last unless $rls[ $p ][ REPEATS ]-- > 2; $rls[ $p ][ REPEATS ] -= length $rls[ $p ][ BASESE +Q ]; } else { last unless $rls[ $q ][ REPEATS ]-- > 2; $rls[ $q ][ REPEATS ] -= length $rls[ $q ][ BASESE +Q ]; $rls[ $q ][ OFFSET ] += length $rls[ $q ][ BASESE +Q ]; } } $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] && $rls[ $q ][ OFFSET ] < $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ] and delete $rls[ $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] ? $p : $q ]; } } @rls = grep defined, @rls; for ( @rls ) { my $re = qr[(?:$_->[ BASESEQ ]){$_->[ &REPEATS ]}]; $s =~ s[ ( (?: ^ | \) ) [^(]*? ) $re] { ($1||'') . '(' . $_->[ REPEATS ] . ":$_->[ BASESEQ ]) +" }ex; } $s = findRuns( $s ) if $s =~ m[ (?: ^ | \) ) [^(]*? ( [a-z]+ ) \1 [^)]*? (?: \( | $ ) ]x; return $s; } my @chars = eval $RANGE; my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; print scalar localtime, ' : ', $s, $/; my $encoded = findRuns $s; print scalar localtime, ' : ', $encoded, $/; $encoded =~ s[\((\d+):([^)]+)\)][$2 x $1]eg; print scalar localtime, ' : ', $encoded, $/;

Some output showing the encoded and decoded versions match:

[ 6:10:08.93] P:\test>390333 -L=70 -RANGE='a'..'b' Sun Sep 12 06:11:42 2004 : aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab + Sun Sep 12 06:11:42 2004 : (2:a)(2:b)(2:baba)(2:a)b(3:a)(3:ab)ba(2:baaa)ba(2:abb)(3:ba)b(2:aabba) +a(3:b)ab(2:aa)b Sun Sep 12 06:11:42 2004 : aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab + [ 6:11:42.57] P:\test>390333 -L=70 -RANGE='a'..'b' Sun Sep 12 06:11:46 2004 : bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa + Sun Sep 12 06:11:46 2004 : (2:b)(3:a)(2:abb)b(3:ba)a(3:b)(2:a)(2:ababb)b(2:bbbba)b(2:a)(2:aab)ab( +2:a)(2:aabb)b(3:a) Sun Sep 12 06:11:46 2004 : bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa

A longer (10,000 random 'a's & 'b's) string processed in around 11 seconds.

Funky formatting courtesy of PM's codewrap "feature"!


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: String Compression Optimization (repeated concatenated subsequences)
by QM (Parson) on Sep 14, 2004 at 01:10 UTC
    Thanks for your code. It seems to work as advertised, but I'll have to run more tests ;)

    I've listed some bounds above, and an update to the original node on a detail I forgot to mention, which may complicate it further.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Those bounds (or perhaps the near lack of them:) will require some optimisations to be applied in order to achieve reasonable runtimes--depending upon what you designate as reasonable?

      One difficulty with looking at a problem that is as open spec'd as this is knowing where to concetrate your efforts. Do you have any "typical" samples of the datastreams that you could make accessible? It's an interesting problem that I'd enjoy pursuing a bit further.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        I'll try and post something here tomorrow, under appropriate readmore tags...

        Update: It's tomorrow, and here's a toy example:

        and a more realistic piece (200 lines):

        The full file is 400MB -- I don't know where I would upload that, or if it would matter.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of