Characters in disguise

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have two columns in a database, let's call them STRING and NORMALIZED_STRING. The characters in STRING are processed by some to me unknown algorithm, to become the characters in the normalized form.

A character can consist of one or more bytes.
A character can be changed in one of several ways:
   1. it can simply be deleted
   2. a multi byte character can be mapped to a single byte character
   3. a single byte character can be mapped to a multibyte character

an example.

STRING   NORMALIZED_STRING
------   -----------------
ABCÅD    ABCD
ABCÄD    ABCëëD
ABCááD   ABCèD

What I want to do is to be able to recreate the same kind of normalization myself. Now, of course I could go through the thousands of rows manually and try to find out how it works. But I am hoping it could be possible to let Perl do the analyzing and by comparing the strings come up with some mapping scheme:

'Å'  => '',
'Ä'  => 'ëë',
'áá' => 'è'

I would be very happy if this was possible!
Thanks in advance for your insights,

Comment on Characters in disguise

Replies are listed 'Best First'.
Re: Characters in disguise by GrandFather (Saint) on Jun 01, 2006 at 22:57 UTC
I'm sure there are many nasty traps here, but you could take a look at Text::Diff. use strict; use warnings; use Algorithm::Diff qw(diff); my @strs = ( ['ABCÅD', 'ABCD'], ['ABCÄD', 'ABCëëD'], ['ABCááD', 'ABCèD'], ); my %xlate; for (@strs) { my ($org, $mod) = @$_; my @orgSeq = split //, $org; my @modSeq = split //, $mod; my (@diffs) = diff (\@orgSeq, \@modSeq); for my $diff (@diffs) { my @sub = grep {$_->[0] eq '-'} @$diff; my @add = grep {$_->[0] eq '+'} @$diff; next if ! @sub; # Bogus added characters $xlate{join '', map {$_->[2]} @sub} = (join '', map {$_->[2]} +@add) \|\| ''; } } print "'$_' => '$xlate{$_}'\n" for sort keys %xlate; [download] Prints: `'Ä' => 'ëë' 'Å' => '' 'áá' => 'è'` [download] Update: changed bogus `use Text::Diff` to `use Algorithm::Diff` and clean up code to suit. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Characters in disguise by Anonymous Monk on Jun 01, 2006 at 23:24 UTC
Splendid! This seems to be the solution. If not a final solution, so very close. I notice we don't need Text::Diff as your code is just calling Algorithm::Diff::diff(). So I replaced use Text::Diff with use Algorithm::Diff. Thank you, and also thanks to tye for his reply and for the module Algorithm::Diff. /L	[reply]
Re^3: Characters in disguise by GrandFather (Saint) on Jun 01, 2006 at 23:28 UTC
Sigh. I started looking at Text::DIff, then decided it wasn't what was needed so I changed the code to use Algorithm::Diff without remembering to change the use :(. Glad it helped. I've learned something getting it going anyway. :) DWIM is Perl's answer to Gödel	[reply]
Re: Characters in disguise by ruzam (Curate) on Jun 01, 2006 at 22:18 UTC
Assumption: there is only one 'mapping' in each string. Remove the leading characters that match, remove the trailing characters that match. What's left is the difference. If my assumption is wrong, then this won't do it for you. use strict; use warnings; binmode(DATA, ":encoding(UTF-8)"); my %normalize; while (<DATA>) { my ($string, $normalized) = split; # convert to arrays (avoid unicode issues?) my @string = $string =~ m/\X/g; my @normalized = $normalized =~ m/\X/g; # skip matching the beginning chars while (@string and @normalized and $string[0] eq $normalized[0]) { shift @string; shift @normalized; } # skip matching end chars while (@string and @normalized and $string[-1] eq $normalized[-1]) { pop @string; pop @normalized; } my $key = join("", @string); $normalize{$key} = join("", @normalized); print "'$key' => '$normalize{$key}'\n"; } __DATA__ ABCÅD ABCD ABCÄD ABCëëD ABCááD ABCèD [download] Produces the results: `'Å' => '' 'Ä' => 'ëë' 'áá' => 'è'` [download]	[reply] [d/l] [select]
Re^2: Characters in disguise by Anonymous Monk on Jun 01, 2006 at 22:35 UTC
Ah, a good idea. But unfortunatly the strings can be more complicated. For example a string could look like "ABCÅDEFÄGHI" or even tricky combinations like "ABCÅÄ" where the analyzer can not be sure if it should be interpreted as {'Å'=>'ë', 'Ä'=>'ë'} or {'Å'=>'', 'Ä'=>'ëë'} or {'Å'=>'ëë', 'Ä'=>''} If at all possible I think this kind of ambiguity has to be resolved when encountering similar strings with just one "interpretation" /L	[reply]
Re: Characters in disguise (diff) by tye (Sage) on Jun 01, 2006 at 22:11 UTC
You could get that kind of data from Algorithm::Diff. Split each string into an array of bytes before passing it in. Watch for multiple adjacent replacements which will appear as single replacements, of course. - tye	[reply]
Re: Characters in disguise by samtregar (Abbot) on Jun 01, 2006 at 21:43 UTC
Sounds tough. You might try a genetic programming approach, breeding programs which apply transformations until one can reproduce the original STRING to NORMALIZED_STRING mapping. That's a lot easier said than done though, and it would be extra tough to do it with Perl. A code-is-data language like Scheme is a more natural fit. You might also look at how String::Approx works. It solves a similar problem, although I don't think you'll be able to use it directly. -sam	[reply]