in reply to Search & replace of UTF-8 characters ?

$line is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character.

While that may be an accurate statement, trying to decipher what it means is not easy.

Here is how I would put it: a unicode character is not the same as a unicode character encoded in UTF-8. There are many encodings, and UTF-8 is only one of them. However, there is only one unicode character for the copyright symbol. Simply put, if you want to match UTF-8 characters in a string, then you need to use UTF-8 characters in your substitution--not unicode characters.

Here is a code example:

use strict; use warnings; use 5.010; use Encode; my $unicode_str = "\x{00a9}"; my $utf8_str = encode('utf-8', $unicode_str); say $utf8_str; #copyright symbol my $line = "$utf8_str hello world"; $line =~ s/$utf8_str/\\textcopyright/; say $line; #\textcopyright hello world #Or you can just start with the UTF-8 character #for the copyright symbol: $line = "\xC2\xA9 hello world"; say $line; #copyright symbol followed by 'hello world' $line =~ s/\xC2\xA9/\\textcopyright/; say $line; #\textcopyright hello world

In my opinion, the easiest way to understand the whole unicode thing is this: a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this:

1 => chinese character for the new year 2 => japanese character for fish 3 => happy face ... ... 60,000 => mongolian character for beef ...

So an encoding takes unicode integers and translates them into characters. Different encodings translate the unicode integers into different characters. UTF-8 is just one encoding, which is very popular.

Replies are listed 'Best First'.
Re^2: Search & replace of UTF-8 characters ?
by ikegami (Patriarch) on Feb 25, 2010 at 18:46 UTC

    While that may be an accurate statement, trying to decipher what it means is not easy

    I didn't want to spend much time confirming something the OP appeared to already know, but thanks for elaborating.

    Update: Although I think your elaboration is flawed.

    a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this:

    Determine the character a value represents is unrelated to encoding/decoding.

    Decoding from UTF-8:

    ... 01 => 01 START OF HEADING ... 30 => 30 DIGIT ZERO ... E2 99 A0 => 2660 BLACK SPADE SUIT ...

    Encoding is the reverse operation.

    There is no difference between 2660 and black spade suit. Black spade suit is just a meaning assumed by 2660. Decoding is definitely not the process of going from 2660 to black spade suit as you claim.

      double post somehow
      Although I think your elaboration is flawed.

      It's definitely not accurate. At the same time, anyone can understand my model, and they should be able to use it to successfully distinguish between unicodes and encodings like utf-8--and convert between them. Or they can read a tutorial an unicode and be completely confused, and not be able to write any code at all.

      Decoding is definitely not the process of going from 2660 to black spades suit as you claim.

      Encoding = convert unicode integer to utf-8 character for output

      Decoding = convert utf-8 character to unicode integer for input

      That simple model will allow any unicode beginner to write a lot of code before having to adjust their mental model. For what it's worth, I've never read a single unicode tutorial that will actually allow you to write code.

        At the same time, anyone can understand my model

        I don't see how that's relevant since the table you posted does not come close to representing encoding.

        Both the input side and the output side of your table is decoded, so it represents neither encoding nor decoding.

        Encoding = convert unicode integer to utf-8 character for output

        I've never heard of "unicode integer" before. Neither has Google. Most people say "unicode character" or "unicode code point".

        "UTF-8 characters" is commonly used to mean for both "the character encoded using UTF-8" and "the bytes resulting from encoding a character using UTF-8". The former is the result of decoding, the latter is the result of encoding. As such, it's meaningless/confusing/unclear how you used it.

        That simple model will allow any unicode beginner

        I'm an expert and I have no idea what you mean by those two lines.

        Fixed terminology:

        • Encoding = Converting unicode characters (e.g. black spade suit) into utf-8 bytes (E2 99 A0) for storage or transmission.
        • Decoding = Converting utf-8 bytes (e.g. E2 99 A0) into the unicode characters (black spade suit) they represent in order to do string manipulations such as counting and comparing characters.

        And now for something short and clear:

        • Encoding is necessary to store text into a file since files can only contain bytes.
        • Decoding gives you back the text you stored into the file.