in reply to Re^2: Search & replace of UTF-8 characters ?
in thread Search & replace of UTF-8 characters ?

Although I think your elaboration is flawed.

It's definitely not accurate. At the same time, anyone can understand my model, and they should be able to use it to successfully distinguish between unicodes and encodings like utf-8--and convert between them. Or they can read a tutorial an unicode and be completely confused, and not be able to write any code at all.

Decoding is definitely not the process of going from 2660 to black spades suit as you claim.

Encoding = convert unicode integer to utf-8 character for output

Decoding = convert utf-8 character to unicode integer for input

That simple model will allow any unicode beginner to write a lot of code before having to adjust their mental model. For what it's worth, I've never read a single unicode tutorial that will actually allow you to write code.

  • Comment on Re^3: Search & replace of UTF-8 characters ?

Replies are listed 'Best First'.
Re^4: Search & replace of UTF-8 characters ?
by ikegami (Patriarch) on Feb 26, 2010 at 06:03 UTC

    At the same time, anyone can understand my model

    I don't see how that's relevant since the table you posted does not come close to representing encoding.

    Both the input side and the output side of your table is decoded, so it represents neither encoding nor decoding.

    Encoding = convert unicode integer to utf-8 character for output

    I've never heard of "unicode integer" before. Neither has Google. Most people say "unicode character" or "unicode code point".

    "UTF-8 characters" is commonly used to mean for both "the character encoded using UTF-8" and "the bytes resulting from encoding a character using UTF-8". The former is the result of decoding, the latter is the result of encoding. As such, it's meaningless/confusing/unclear how you used it.

    That simple model will allow any unicode beginner

    I'm an expert and I have no idea what you mean by those two lines.

    Fixed terminology:

    • Encoding = Converting unicode characters (e.g. black spade suit) into utf-8 bytes (E2 99 A0) for storage or transmission.
    • Decoding = Converting utf-8 bytes (e.g. E2 99 A0) into the unicode characters (black spade suit) they represent in order to do string manipulations such as counting and comparing characters.

    And now for something short and clear:

    • Encoding is necessary to store text into a file since files can only contain bytes.
    • Decoding gives you back the text you stored into the file.