in reply to Characters in disguise

Assumption: there is only one 'mapping' in each string.

Remove the leading characters that match, remove the trailing characters that match. What's left is the difference. If my assumption is wrong, then this won't do it for you.
use strict; use warnings; binmode(DATA, ":encoding(UTF-8)"); my %normalize; while (<DATA>) { my ($string, $normalized) = split; # convert to arrays (avoid unicode issues?) my @string = $string =~ m/\X/g; my @normalized = $normalized =~ m/\X/g; # skip matching the beginning chars while (@string and @normalized and $string[0] eq $normalized[0]) { shift @string; shift @normalized; } # skip matching end chars while (@string and @normalized and $string[-1] eq $normalized[-1]) { pop @string; pop @normalized; } my $key = join("", @string); $normalize{$key} = join("", @normalized); print "'$key' => '$normalize{$key}'\n"; } __DATA__ ABCÅD ABCD ABCÄD ABCëëD ABCááD ABCèD
Produces the results:
'Å' => '' 'Ä' => 'ëë' 'áá' => 'è'

Replies are listed 'Best First'.
Re^2: Characters in disguise
by Anonymous Monk on Jun 01, 2006 at 22:35 UTC
    Ah, a good idea. But unfortunatly the strings can be more complicated. For example a string could look like
    "ABCÅDEFÄGHI"
    or even tricky combinations like
    "ABCÅÄ"
    where the analyzer can not be sure if it should be interpreted as
    {'Å'=>'ë', 'Ä'=>'ë'} or 
    {'Å'=>'', 'Ä'=>'ëë'} or 
    {'Å'=>'ëë', 'Ä'=>''}
    

    If at all possible I think this kind of ambiguity has to be resolved when encountering similar strings with just one "interpretation"

    /L