in reply to Re: Search & replace of UTF-8 characters ?
in thread Search & replace of UTF-8 characters ?

While that may be an accurate statement, trying to decipher what it means is not easy

I didn't want to spend much time confirming something the OP appeared to already know, but thanks for elaborating.

Update: Although I think your elaboration is flawed.

a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this:

Determine the character a value represents is unrelated to encoding/decoding.

Decoding from UTF-8:

... 01 => 01 START OF HEADING ... 30 => 30 DIGIT ZERO ... E2 99 A0 => 2660 BLACK SPADE SUIT ...

Encoding is the reverse operation.

There is no difference between 2660 and black spade suit. Black spade suit is just a meaning assumed by 2660. Decoding is definitely not the process of going from 2660 to black spade suit as you claim.

Replies are listed 'Best First'.
Re^3: Search & replace of UTF-8 characters ?
by 7stud (Deacon) on Feb 26, 2010 at 01:40 UTC
    double post somehow
Re^3: Search & replace of UTF-8 characters ?
by 7stud (Deacon) on Feb 26, 2010 at 01:40 UTC
    Although I think your elaboration is flawed.

    It's definitely not accurate. At the same time, anyone can understand my model, and they should be able to use it to successfully distinguish between unicodes and encodings like utf-8--and convert between them. Or they can read a tutorial an unicode and be completely confused, and not be able to write any code at all.

    Decoding is definitely not the process of going from 2660 to black spades suit as you claim.

    Encoding = convert unicode integer to utf-8 character for output

    Decoding = convert utf-8 character to unicode integer for input

    That simple model will allow any unicode beginner to write a lot of code before having to adjust their mental model. For what it's worth, I've never read a single unicode tutorial that will actually allow you to write code.

      At the same time, anyone can understand my model

      I don't see how that's relevant since the table you posted does not come close to representing encoding.

      Both the input side and the output side of your table is decoded, so it represents neither encoding nor decoding.

      Encoding = convert unicode integer to utf-8 character for output

      I've never heard of "unicode integer" before. Neither has Google. Most people say "unicode character" or "unicode code point".

      "UTF-8 characters" is commonly used to mean for both "the character encoded using UTF-8" and "the bytes resulting from encoding a character using UTF-8". The former is the result of decoding, the latter is the result of encoding. As such, it's meaningless/confusing/unclear how you used it.

      That simple model will allow any unicode beginner

      I'm an expert and I have no idea what you mean by those two lines.

      Fixed terminology:

      • Encoding = Converting unicode characters (e.g. black spade suit) into utf-8 bytes (E2 99 A0) for storage or transmission.
      • Decoding = Converting utf-8 bytes (e.g. E2 99 A0) into the unicode characters (black spade suit) they represent in order to do string manipulations such as counting and comparing characters.

      And now for something short and clear:

      • Encoding is necessary to store text into a file since files can only contain bytes.
      • Decoding gives you back the text you stored into the file.