baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Given a set of strings S of fixed length |S|=l and alphabet = {a,b,c} such that:
aaaaaaaaabababacbbbbbbbaccaabc aaaaaaaaaaaaaaaaacbbbbbbbaaaca aaaaaaaaabaaaaabbaaaaabaaccccc
what transforms can i apply to homogenize them by either:

a) grouping same characters together within each string
b) grouping same characters together across all strings (e.g. have most aaa's in one, bbb's in the other, etc. - of course, sorting does not count as the number of character occurrences does not change per string)
c) transform (map) characters e.g. a-> b
d) something else

However, conditions are:
a) the size of the original input data cannot be smaller than its transform, meaning the index needs to be implicitly built into the data (in the same way BWT does it).
b) Types of queries that need to be supported:
What is the character on i-th position of the original string. where 0<=i<=l.

Some of the obvious solutions are:

a) BWT
b) BWT using all strings
c) replace a with b and record coordinates using either bitstrings or ints (but this violates the size condition unless there is a smart way to record positions within the strings itself somehow)
d) something else

Has anyone encountered this problem before? How did you solve it?

thnx

Replies are listed 'Best First'.
Re: Problems with strings
by QM (Parson) on May 20, 2019 at 10:55 UTC
    Define homogenize?

    Why is "Types of queries..." under "conditions"? Surely that's a feature?

    I (we?) have no idea what you have, what you want, or where you need to get to.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Example: Heterogenic string:
      ababababababaab
      Homogenic string (after BWT (ref: Perl impl of bwt))
      bbbbbbbaaaaaaaa
      same characters stick together! I have not added any extra information to the initial string and yet have an encoder that homogenizes the initial string. So what other transforms you now of, that can be used to achieve similar behaviour
Re: Problems with strings
by AnomalousMonk (Archbishop) on May 20, 2019 at 16:06 UTC
    What is the character on i-th position of the original string. where 0<=i<=l.

    I would have said this would be substr, except that i (0-based character position index(?)) may equal l (length of string(?)); I don't understand the last bit.


    Give a man a fish:  <%-{-{-{-<

      BWT = Burrows-Wheeler_transform is a way to provide lossless compression by using the redundancy of repeated passages (like syllables in speech)

      My guess is the OP is still trying to defy entropy and to improve the compression of his last problem.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Problems with strings
by thanos1983 (Parson) on May 20, 2019 at 13:26 UTC

    Hello baxy77bax,

    If I understand correctly your question and desired output one possible way is:

    Regarding your second question on how to repeat the process for multiple strings (simply concatenate the strings):

    Update: Last part (replace character in a string):

    Hope this helps, Thanos.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Problems with strings
by johngg (Canon) on May 20, 2019 at 15:53 UTC

    I'm not at all sure I understand your requirement. The query you mention suggests that the easiest thing to do would be to keep the original string along with any transformations. Here's my stab at it.

    use 5.026; use warnings; use Data::Dumper; my @strings = qw{ aaaaaaaaabababacbbbbbbbaccaabc aaaaaaaaaaaaaaaaacbbbbbbbaaaca aaaaaaaaabaaaaabbaaaaabaaccccc }; homogenise( \ @strings ); print Data::Dumper->Dumpxs( [ \ @strings ], [ qw{ *strings } ] ); sub homogenise { my $raStrings = shift; my $len = length $raStrings->[ 0 ]; foreach my $string ( @{ $raStrings } ) { $string = { original => $string }; $string->{ byLine } = join q{}, sort split m{}, $string->{ original }; } my $allSorted = join q{}, sort map { split m{}, $_->{ original } } @{ $raStrings }; my $idx = 0; $raStrings->[ $idx ++ ]->{ bySet } = $_ for unpack qq{(a$len)*}, $allSorted; }

    The output.

    @strings = ( { 'bySet' => 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'original' => 'aaaaaaaaabababacbbbbbbbaccaabc', 'byLine' => 'aaaaaaaaaaaaaaabbbbbbbbbbbcccc' }, { 'byLine' => 'aaaaaaaaaaaaaaaaaaaaabbbbbbbcc', 'original' => 'aaaaaaaaaaaaaaaaacbbbbbbbaaaca', 'bySet' => 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbb' }, { 'byLine' => 'aaaaaaaaaaaaaaaaaaaaabbbbccccc', 'bySet' => 'bbbbbbbbbbbbbbbbbbbccccccccccc', 'original' => 'aaaaaaaaabaaaaabbaaaaabaaccccc' } );

    I hope that this is vaguely in the direction you are wanting to go.

    Cheers,

    JohnGG