Re: String Compression Optimization (repeated concatenated subsequences)

If I've understood the problem, this is much less complex than the LZ case as your looking to replace sequential repeats rather sub-sequences anywhere in the string.

I won't swear the results of the below are optimal in all cases-- that's a huge task to verify, and damned hard to analyse--but on the various tests I've run it seems to a fair job pretty quickly.

 
#! perl -slw 
use strict; 

our $L ||= 200; 
our $SEED ||= 1; 
our $RANGE ||= "'a'..'c'"; 

#srand $SEED; 

my @chars = eval $RANGE; 
my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; 
my @rls; use constant { BASESEQ => 0, SAVED => 1, OFFSET => 2, LENGTH 
+=> 3, REPEATS => 4 }; 

## Find runs an record them 
push @rls, [ 
    $1, 
    length $2, 
    pos($s) - length( $1 ), 
    length( $1 ) + length( $2 ), 
    length( $2 ) / length( $1 ), 
] while $s =~ m[(.+)(?=(\1{1,254}))]g; 

## Eliminate overlaps deleting the 
## lesser saving in each case. 

for my $p ( 0 .. $#rls ) { 
    next unless defined $rls[ $p ]; 
    for my $q ( $p+1 .. $#rls ) { 
        next unless defined $rls[ $p ] 
            and defined $rls[ $q ]; 
        last if $rls[ $q ][ OFFSET ] 
              > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; 
              
           $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
        && $rls[ $q ][ OFFSET ] <  $rls[ $p ][ OFFSET ] + $rls[ $p ][ 
+LENGTH ] 
        and delete $rls[ 
            $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] 
            ? $p 
            : $q 
        ]; 
    }
} 

## Remove the overlaps 
@rls = grep defined, @rls; 

## Display the results 
print $s; 
print ' ' x $_->[ OFFSET ], uc( $_->[ BASESEQ ] ), $_->[ BASESEQ ] x $
+_->[ REPEATS ] for @rls; 

__END__
[download]

Some results (I used an input range of 'a'..'b' so as to produce a large number of overlapps for testing).

 
[ 2:44:41.87] P:\test>390333 -L=70 -RANGE='a'..'b' 
abaabbaaaabbaaaabaabaabaaaababaababbbbbaababbaabbaaaaabbbbaabbaababbba
+ 
AABBAAaabbaa AABaabaab AABABaabab Bbb ABBAabba Aaa BBAAbbaa Bbb 

[ 2:44:42.69] P:\test>390333 -L=70 -RANGE='a'..'b' 
abbbaaabbaabbbbbbbbbaaabbaaaaabbbabaaaabaaababaababbabbaaaaaabbbbbabbb
+ 
Bbb AABBaabb BBAAAbbaaa Aa Bbb AABABaabab Bb AAAaaa BBbb Bbb 

[ 2:44:43.55] P:\test>390333 -L=70 -RANGE='a'..'b' 
aabbbaaaabaabbbbbbbabaaabaaaaaabaabbbabaabbaabbaababbaaabaabaaabbababb
+ 
Aa Bbb AABaab BBbb BAAAbaaa AABaab BAABbaabbaab Bb AABaab Aaa BAba Bb 

[ 2:44:44.37] P:\test>390333 -L=70 -RANGE='a'..'b' 
aabbbbaaabbbabababbbabbbbbababababbaabaabaaaabaababbaabbabbabbbbaabbbb
+ 
Aa BBbb Aaa BAbaba BBbb BABAbaba BAAbaabaa AABaab ABBabbabb Bb Aa BBbb
[download]

UPDATED: Here's a somewhat improved version that recursively processes the string so that sub-runs rejected because they overlapped are reprocessed.

 
#! perl -slw 
use strict; 
use constant { 
    BASESEQ => 0, SAVED => 1, OFFSET => 2, 
    LENGTH => 3, REPEATS => 4 
}; 

our $L ||= 200; 
our $RANGE ||= "'a'..'c'"; 

#srand $SEED if $SEED; 

sub findRuns { 
    my $s = shift; 
    my @rls; 
    push @rls, [ 
        $1, 
        length $2, 
        pos($s) - length( $1 ), 
        length( $1 ) + length( $2 ), 
        length( $2 ) / length( $1 ) + 1, 
    ] while $s =~ m[ 
        .*? 
        ( [a-z]+ ) 
        (?= ( \1{1,254} ) 
        [^)]* 
        (?: \( | $ ) ) 
    ]gx; 
    
    for my $p ( 0 .. $#rls ) { 
        next unless defined $rls[ $p ]; 
        for my $q ( $p+1 .. $#rls ) { 
            next unless defined $rls[ $p ] and defined $rls[ $q ]; 
            last if $rls[ $q ][ OFFSET ] 
                  > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; 
                  
            while( 
                $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
                && $rls[ $q ][ OFFSET ] 
                 < $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ] 
            ) { 
                if( $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] ) { 
                    last unless $rls[ $p ][ REPEATS ]-- > 2; 
                    $rls[ $p ][ REPEATS ] -= length $rls[ $p ][ BASESE
+Q ]; 
                } 
                else { 
                    last unless $rls[ $q ][ REPEATS ]-- > 2; 
                    $rls[ $q ][ REPEATS ] -= length $rls[ $q ][ BASESE
+Q ]; 
                    $rls[ $q ][ OFFSET  ] += length $rls[ $q ][ BASESE
+Q ]; 
                } 
            } 
            
               $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
            && $rls[ $q ][ OFFSET ] <  $rls[ $p ][ OFFSET ] 
                                    +  $rls[ $p ][ LENGTH ] 
            and delete $rls[ 
                $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] 
                ? $p 
                : $q 
            ]; 
        } 
    } 
    @rls = grep defined, @rls; 
    
    for ( @rls ) { 
        my $re = qr[(?:$_->[ BASESEQ ]){$_->[ &REPEATS ]}]; 
        $s =~ s[ ( (?: ^ | \) ) [^(]*? ) $re] 
               { ($1||'') . '(' . $_->[ REPEATS ] . ":$_->[ BASESEQ ])
+" }ex; 
    } 
               
    $s = findRuns( $s ) if $s =~ m[ 
        (?: ^ | \) ) [^(]*? ( [a-z]+ ) \1 [^)]*? (?: \( | $ ) 
    ]x; 
    return $s; 
} 

my @chars = eval $RANGE; 
my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; 
print scalar localtime, ' : ', $s, $/; 
my $encoded = findRuns $s; 
print scalar localtime, ' : ', $encoded, $/; 
$encoded =~ s[\((\d+):([^)]+)\)][$2 x $1]eg; 
print scalar localtime, ' : ', $encoded, $/;
[download]

Some output showing the encoded and decoded versions match:

 
[ 6:10:08.93] P:\test>390333 -L=70 -RANGE='a'..'b' 

Sun Sep 12 06:11:42 2004 : 
aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab
+ 

Sun Sep 12 06:11:42 2004 : 
(2:a)(2:b)(2:baba)(2:a)b(3:a)(3:ab)ba(2:baaa)ba(2:abb)(3:ba)b(2:aabba)
+a(3:b)ab(2:aa)b 

Sun Sep 12 06:11:42 2004 : 
aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab
+ 

[ 6:11:42.57] P:\test>390333 -L=70 -RANGE='a'..'b' 
Sun Sep 12 06:11:46 2004 : 
bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa
+ 

Sun Sep 12 06:11:46 2004 : 
(2:b)(3:a)(2:abb)b(3:ba)a(3:b)(2:a)(2:ababb)b(2:bbbba)b(2:a)(2:aab)ab(
+2:a)(2:aabb)b(3:a) 

Sun Sep 12 06:11:46 2004 : 
bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa
[download]

A longer (10,000 random 'a's & 'b's) string processed in around 11 seconds.

Funky formatting courtesy of PM's codewrap "feature"!

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: String Compression Optimization (repeated concatenated subsequences) Select or Download Code

Replies are listed 'Best First'.
Re^2: String Compression Optimization (repeated concatenated subsequences) by QM (Parson) on Sep 14, 2004 at 01:10 UTC
Thanks for your code. It seems to work as advertised, but I'll have to run more tests ;) I've listed some bounds above, and an update to the original node on a detail I forgot to mention, which may complicate it further. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^3: String Compression Optimization (repeated concatenated subsequences) by BrowserUk (Patriarch) on Sep 14, 2004 at 01:16 UTC
Those bounds (or perhaps the near lack of them:) will require some optimisations to be applied in order to achieve reasonable runtimes--depending upon what you designate as reasonable? One difficulty with looking at a problem that is as open spec'd as this is knowing where to concetrate your efforts. Do you have any "typical" samples of the datastreams that you could make accessible? It's an interesting problem that I'd enjoy pursuing a bit further. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^4: String Compression Optimization (repeated concatenated subsequences) by QM (Parson) on Sep 14, 2004 at 01:33 UTC
I'll try and post something here tomorrow, under appropriate readmore tags... Update: It's tomorrow, and here's a toy example: Read more... (1119 Bytes) and a more realistic piece (200 lines): Read more... (27 kB) The full file is 400MB -- I don't know where I would upload that, or if it would matter. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^5: String Compression Optimization (repeated concatenated subsequences) by BrowserUk (Patriarch) on Sep 14, 2004 at 05:59 UTC
Re^5: String Compression Optimization (repeated concatenated subsequences) by BrowserUk (Patriarch) on Sep 14, 2004 at 20:07 UTC
Re^5: String Compression Optimization (repeated concatenated subsequences) by Beechbone (Friar) on Sep 15, 2004 at 09:57 UTC