comment on

If I've understood the problem, this is much less complex than the LZ case as your looking to replace sequential repeats rather sub-sequences anywhere in the string.

I won't swear the results of the below are optimal in all cases-- that's a huge task to verify, and damned hard to analyse--but on the various tests I've run it seems to a fair job pretty quickly.

 
#! perl -slw 
use strict; 

our $L ||= 200; 
our $SEED ||= 1; 
our $RANGE ||= "'a'..'c'"; 

#srand $SEED; 

my @chars = eval $RANGE; 
my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; 
my @rls; use constant { BASESEQ => 0, SAVED => 1, OFFSET => 2, LENGTH 
+=> 3, REPEATS => 4 }; 

## Find runs an record them 
push @rls, [ 
    $1, 
    length $2, 
    pos($s) - length( $1 ), 
    length( $1 ) + length( $2 ), 
    length( $2 ) / length( $1 ), 
] while $s =~ m[(.+)(?=(\1{1,254}))]g; 

## Eliminate overlaps deleting the 
## lesser saving in each case. 

for my $p ( 0 .. $#rls ) { 
    next unless defined $rls[ $p ]; 
    for my $q ( $p+1 .. $#rls ) { 
        next unless defined $rls[ $p ] 
            and defined $rls[ $q ]; 
        last if $rls[ $q ][ OFFSET ] 
              > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; 
              
           $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
        && $rls[ $q ][ OFFSET ] <  $rls[ $p ][ OFFSET ] + $rls[ $p ][ 
+LENGTH ] 
        and delete $rls[ 
            $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] 
            ? $p 
            : $q 
        ]; 
    }
} 

## Remove the overlaps 
@rls = grep defined, @rls; 

## Display the results 
print $s; 
print ' ' x $_->[ OFFSET ], uc( $_->[ BASESEQ ] ), $_->[ BASESEQ ] x $
+_->[ REPEATS ] for @rls; 

__END__
[download]

Some results (I used an input range of 'a'..'b' so as to produce a large number of overlapps for testing).

 
[ 2:44:41.87] P:\test>390333 -L=70 -RANGE='a'..'b' 
abaabbaaaabbaaaabaabaabaaaababaababbbbbaababbaabbaaaaabbbbaabbaababbba
+ 
AABBAAaabbaa AABaabaab AABABaabab Bbb ABBAabba Aaa BBAAbbaa Bbb 

[ 2:44:42.69] P:\test>390333 -L=70 -RANGE='a'..'b' 
abbbaaabbaabbbbbbbbbaaabbaaaaabbbabaaaabaaababaababbabbaaaaaabbbbbabbb
+ 
Bbb AABBaabb BBAAAbbaaa Aa Bbb AABABaabab Bb AAAaaa BBbb Bbb 

[ 2:44:43.55] P:\test>390333 -L=70 -RANGE='a'..'b' 
aabbbaaaabaabbbbbbbabaaabaaaaaabaabbbabaabbaabbaababbaaabaabaaabbababb
+ 
Aa Bbb AABaab BBbb BAAAbaaa AABaab BAABbaabbaab Bb AABaab Aaa BAba Bb 

[ 2:44:44.37] P:\test>390333 -L=70 -RANGE='a'..'b' 
aabbbbaaabbbabababbbabbbbbababababbaabaabaaaabaababbaabbabbabbbbaabbbb
+ 
Aa BBbb Aaa BAbaba BBbb BABAbaba BAAbaabaa AABaab ABBabbabb Bb Aa BBbb
[download]

UPDATED: Here's a somewhat improved version that recursively processes the string so that sub-runs rejected because they overlapped are reprocessed.

 
#! perl -slw 
use strict; 
use constant { 
    BASESEQ => 0, SAVED => 1, OFFSET => 2, 
    LENGTH => 3, REPEATS => 4 
}; 

our $L ||= 200; 
our $RANGE ||= "'a'..'c'"; 

#srand $SEED if $SEED; 

sub findRuns { 
    my $s = shift; 
    my @rls; 
    push @rls, [ 
        $1, 
        length $2, 
        pos($s) - length( $1 ), 
        length( $1 ) + length( $2 ), 
        length( $2 ) / length( $1 ) + 1, 
    ] while $s =~ m[ 
        .*? 
        ( [a-z]+ ) 
        (?= ( \1{1,254} ) 
        [^)]* 
        (?: \( | $ ) ) 
    ]gx; 
    
    for my $p ( 0 .. $#rls ) { 
        next unless defined $rls[ $p ]; 
        for my $q ( $p+1 .. $#rls ) { 
            next unless defined $rls[ $p ] and defined $rls[ $q ]; 
            last if $rls[ $q ][ OFFSET ] 
                  > $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ]; 
                  
            while( 
                $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
                && $rls[ $q ][ OFFSET ] 
                 < $rls[ $p ][ OFFSET ] + $rls[ $p ][ LENGTH ] 
            ) { 
                if( $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] ) { 
                    last unless $rls[ $p ][ REPEATS ]-- > 2; 
                    $rls[ $p ][ REPEATS ] -= length $rls[ $p ][ BASESE
+Q ]; 
                } 
                else { 
                    last unless $rls[ $q ][ REPEATS ]-- > 2; 
                    $rls[ $q ][ REPEATS ] -= length $rls[ $q ][ BASESE
+Q ]; 
                    $rls[ $q ][ OFFSET  ] += length $rls[ $q ][ BASESE
+Q ]; 
                } 
            } 
            
               $rls[ $q ][ OFFSET ] >= $rls[ $p ][ OFFSET ] 
            && $rls[ $q ][ OFFSET ] <  $rls[ $p ][ OFFSET ] 
                                    +  $rls[ $p ][ LENGTH ] 
            and delete $rls[ 
                $rls[ $p ][ SAVED ] < $rls[ $q ][ SAVED ] 
                ? $p 
                : $q 
            ]; 
        } 
    } 
    @rls = grep defined, @rls; 
    
    for ( @rls ) { 
        my $re = qr[(?:$_->[ BASESEQ ]){$_->[ &REPEATS ]}]; 
        $s =~ s[ ( (?: ^ | \) ) [^(]*? ) $re] 
               { ($1||'') . '(' . $_->[ REPEATS ] . ":$_->[ BASESEQ ])
+" }ex; 
    } 
               
    $s = findRuns( $s ) if $s =~ m[ 
        (?: ^ | \) ) [^(]*? ( [a-z]+ ) \1 [^)]*? (?: \( | $ ) 
    ]x; 
    return $s; 
} 

my @chars = eval $RANGE; 
my $s = join'', map{ ( @chars )[ rand @chars ] } 1 .. $L; 
print scalar localtime, ' : ', $s, $/; 
my $encoded = findRuns $s; 
print scalar localtime, ' : ', $encoded, $/; 
$encoded =~ s[\((\d+):([^)]+)\)][$2 x $1]eg; 
print scalar localtime, ' : ', $encoded, $/;
[download]

Some output showing the encoded and decoded versions match:

 
[ 6:10:08.93] P:\test>390333 -L=70 -RANGE='a'..'b' 

Sun Sep 12 06:11:42 2004 : 
aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab
+ 

Sun Sep 12 06:11:42 2004 : 
(2:a)(2:b)(2:baba)(2:a)b(3:a)(3:ab)ba(2:baaa)ba(2:abb)(3:ba)b(2:aabba)
+a(3:b)ab(2:aa)b 

Sun Sep 12 06:11:42 2004 : 
aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab
+ 

[ 6:11:42.57] P:\test>390333 -L=70 -RANGE='a'..'b' 
Sun Sep 12 06:11:46 2004 : 
bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa
+ 

Sun Sep 12 06:11:46 2004 : 
(2:b)(3:a)(2:abb)b(3:ba)a(3:b)(2:a)(2:ababb)b(2:bbbba)b(2:a)(2:aab)ab(
+2:a)(2:aabb)b(3:a) 

Sun Sep 12 06:11:46 2004 : 
bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa
[download]

A longer (10,000 random 'a's & 'b's) string processed in around 11 seconds.

Funky formatting courtesy of PM's codewrap "feature"!

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: String Compression Optimization (repeated concatenated subsequences) by BrowserUk
in thread String Compression Optimization (repeated concatenated subsequences) by QM

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.