You got the output example backwards.
Line one
Line two
Line five
reduces to:
Line_o
Line_t
Line_f
and:
Lines end here not
Lines end here
Lines end here too
reduces to:
L1
L2
L3
The script is intended to create the shortest possible unique string, with a given maximum number of characters from the beginning of the string. When two strings are identical, a number must be appended and since the text is of no use in distinguishing those strings, and the shortest possible unique string contains just one (actually, zero) characters, that's how many characters are kept.
Whether sticking to that produces the most useful function is another issue, but it does produce the shortest string and puts value on the inclusion of text when the text isn't identical.
I will be adding an option to have identical text that's appended with a number contain the maximum number of characters instead of just one.
| [reply] |
You got the output example backwards.
So I did. Ack! But you got the jist of my question.
Whether sticking to that produces the most useful function is another issue, but it does produce the shortest string and puts value on the inclusion of text when the text isn't identical.
Ah. Interesting. It sounds like there is essentially a trade-off to be made, between shortest possible and inclusion of meaningful text. I guess I would tend to weight the latter more heavily than you did. Either that, or not at all, which is (I believe) what Text::Abbrev does.
Here's my solution which gives absolute priority to inclusion of meaningful text. It doesn't maximize compression of strings such as xxxxxxxxxx the way yours does, but I'm viewing those as unlikely input. For "meaningful" text such as Four score..., the results are satisfactory (IMHO).
my @values = <DATA>;
print "\nValues:\n", @values;
chomp @values;
sub max_len
{
my $len = -1;
for ( @_ )
{
$len < length $_ and
$len = length $_;
}
$len
}
sub min
{
$_[0] < $_[1] ? $_[0] : $_[1]
}
my $max_length = min( 10, max_len(@values) );
# normally, we only want to allow valid ID characters:
my @idents = map { s/\W+/_/g; $_ } @values;
print "\nIdentifiers (raw):\n", map "$_\n", @idents;
print "\n\n\n\n";
my %mapping; # contains the final result
{ # this is the part that does the real work:
my %h;
for ( @idents )
{
push @{ $h{ substr $_, 0, $max_length } }, $_;
}
for ( keys %h )
{
my $n = @{ $h{$_} };
my $n_width = length $n;
if ( $n == 1 )
{
$mapping{$_} = $h{$_}[0];
}
else
{
my $prefix = substr $_, 0, $max_length - $n_width;
my $oneup;
for ( @{ $h{$_} } )
{
my $k;
do {
$k = $prefix . sprintf '%0'.$n_width.'d', ++$oneup;
} while exists $mapping{$k};
$mapping{$k} = $_;
}
}
}
} # end of main algorithm.
for ( sort keys %mapping )
{
print "$_ => '$mapping{$_}'\n";
}
__DATA__
line one
line two
lines end here not
lines end here too
yyyyyyy
yyyyyyy
xxxxxxxxxx
xxxxxxxxxx
zzzzzzzzzzzzz
zzzzzzzzzzzzz
r
r
rr
rr
r1
r1
r12
r12
Four score and seven years ago, our fathers...
Four score and seven years ago, our mothers...
Four score and seven years ago, our brothers...
| [reply] [d/l] [select] |