Howdy!

I have a specialized character encoding scheme that I am trying to get working with Encode. I'm so close, and yet not quite there. The encoding is known as "Daud" by its users.

The encoding scheme

The objective is to represent accented characters (primarily Latin-1 but from a number of other pages) in ASCII in a lossless manner. Non-ASCII characters are encoded as a one or two character string within braces. The characters (two in nearly every case) have mnemonic value.

For example, 'LATIN CAPITAL LETTER A WITH CIRCUMFLEX', U+00C1, is encoded as {A^}. The encoding is typically the base letter plus a modifier. Just to be different, 'LATIN SMALL LETTER DOTLESS I' is encoded as {i}. Hold that thought.

The lone problem

{i} does not properly get translated into U+0131, although it goes the other way just fine.

What I did

I generated a .ucm file. I started with 8859-1 and hacked. The file looks like:

<code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... <U00C2> \x7B\x41\x5E\x7D |0 # LATIN CAPITAL LETTER A WITH CIRCUMFLEX ... <U0131> \x7B\x69\x7D |0 # LATIN SMALL LETTER DOTLESS I ... END CHARMAP

I found that I had to comment out at least the left curly bracket to make this work at all. That's fine, as it's not an independently valid character in text in this encoding.

I wrote a test file that exercises each and every character, converting the Unicode character into the Daud equivalent, and then doing a round-trip Unicode -> Daud -> Unicode. The test file passes 665/666. The only test case that is failing is the round trip for {i}.

my $string = decode("daud", encode("daud", $tests{$name}->{unicode +}));
leaves $string empty.

What else have I observed?

I used enc2xs to convert the UCM file into Daud_t.c. My examination of that C file leaves me even more puzzled. I can see how the data structures there appear to correctly map {i} to a Unicode character.

Does anyone have any useful insights or dope-slaps for the obvious thing I'm missing? Is there other information I should have provided that I held back because I didn't want to make a total dump of the stuff?

Update 24 hours later

Thank you graff. Moving the dotless i line higher in the .ucm file did the trick. I was also able to keep the tests passing when I uncommented the RIGHT CURLY BRACE, but the LEFT CURLY BRACE needed to stay out of circulation. So far, at least.

Encode comes with some utilities, including a sort utility, but that did not reorder things into the working order, so that was a bust.

By the time I got to writing a SOPW, I had run out of creative ideas. Now to move forward with the companion encodings that do lossy conversions to Latin-1 and ASCII. And play around with some fallback conversions. Wheee!

yours,
Michael

In reply to Encoding: my custom encoding fails on one character but works for everything else?! by herveus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.