The unicode / utf8 struggle, part 2: regexes

isync has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: The unicode / utf8 struggle, part 2: regexes by Joost (Canon) on May 17, 2007 at 12:03 UTC
You are confused. The only requirement for regexes or any other string operation is that the strings are correctly flagged as utf-8 (internal multi-byte format) or not (internal 1-bit format). My guess is that your $internal_format_string isn't flagged as utf-8. You can do `print utf8::is_utf8($internal_format_string)` to check for the utf-8 flag. Note: this says NOTHING about the actual encoding since there is no way to reliably determine the encoding aside from reading this flag. 2) Also, the way to convert a string that's in utf-8 encoding but not flagged is to use `Encode::decode("utf8",$octets);`, NOT encode_utf8, since that does exactly the opposite of what you think. Also also, if you're reading or writing unicode/utf8 data from a handle, you must* set the ":utf8" layer first, using binmode or open. This includes STDIN, STDOUT and STDERR - unless you're using the -C perlrun flag. If you make sure to set the IO layers correctly you shouldn't have to worry about anything else, though it might still help to upgrade to the latest perl (5.8.8, currently) * update: well, ok, not MUST, but it does make life a whole lot easier. 2) update 2: in other words, perl relies on you, the programmer, to correctly identify the encoding of any incoming or outgoing string (via IO layers, i.e. binmode() or open() arguments) and literal strings (using the utf8 pragma to signal utf8-encoded scripts). If you correctly specify those encodings, perl will internally convert those strings to either "utf8" (which is more or less identical to the UTF-8 unicode encoding, at least on non-EBCDIC systems) or whatever default 1-bit encoding your system uses, and it will set a flag for each string to signal which of those two encodings is used. The intention is that the programmer should normally not have to care at all which of the two encodings is used. All relevant string operations check that 1-bit flag to see how the string(s) in question should be interpreted and return a correctly flagged result in one of the two encodings. Then, whenever the string is send out to a IO handle, it gets converted to the requested output encoding (see the binmode/open remark above). Now, the unicode support in perl is relatively new, so there are probably still bugs in it, but most bugs I've seen in real-world programs were due to misunderstanding of the above, directly and wrongly messing with the utf8 flag of strings (see Encode's _utf8_on and _utf8_off if you're curious), using bytes or using old modules that try to handle utf-8 encoded text without setting the utf-8 flag. Oh, and there are still a few unicode-related bugs in DBD::mysql, but it's getting better :-) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re: The unicode / utf8 struggle, part 2: regexes by graff (Chancellor) on May 17, 2007 at 14:20 UTC
Technically, it's true that "perl's internal character format" is not exactly utf8, but the difference applies only to the unicode characters in the range U+0080 - U+00FF, which are supposed to be two-byte wide characters in "official" utf8, but are stored internally by Perl as single bytes. But you should not have to concern yourself with this technical detail. As far as most Perl programmers are concerned, perl's internal character format is utf8, and when you have to deal with input or output data that is not utf8, then everything else you need is provided by PerlIO and/or Encode. This is my procedure pipeline: read a string from variously encoded sources --> decode it properly to get "perl's internal format" do various things with the textual data re-encode it to utf8 (effectively a transport/storage format) and write it to disk (in binmode). You got the first step right -- no problem there (but go ahead and use the term "utf8" in place of "perl's internal format" -- that's true enough). The second step is fine, assuming "things" include any of the character-based functions (index, substr, length, split, regex matches, s///, tr///, and so on). It's all done with characters, and you just need to think about characters (not bytes). Something like `s/(\d{2})\D(\d{2})\D(\d{4})/$3-$1-$2/g` will rearrange digit characters, no matter whether they are ASCII, Arabic, Chinese, Devanagari or other digits. The third step is a misunderstanding. If you have successfully "decode"d non-unicode input data to perl's utf8, and you have set your output file handle to use utf8 protocol, the output will be valid utf8 character data (1 to 4 bytes per character, depending on which code points are involved). The correct perl syntax for a literal unicode character code point is: `"\x{hhhh}"`; you can safely use that for all code points: `print "\x{0030}\n";` will print the ASCII ZERO character followed by linefeed. (But for unicode characters below U+0100, you can also use just `"\xhh"`.) UPDATE: Sorry, I should have noticed a few other things in your post... had the following regex: `$internal_format_string =~ s/\n//g;` [download] and it removed some letters, spaces and a lot more! I'll bet that if you deleted carriage returns as well ( `s/[\r\n]+//g`), things would look better. I think there may be some "uncharted territory" involving interactions among PerlIO layers -- when you set an encoding mode on a file handle, it might affect the choice of CRLF vs. "raw" (or LF) mode in some unexpected way. You may want to study PerlIO on that issue.	[reply] [d/l] [select]
Re: The unicode / utf8 struggle, part 2: regexes by isync (Hermit) on May 17, 2007 at 15:40 UTC
First, thanks A LOT for the insightful answers! Both of you are talking about filehandles, I wish I were using some... In fact I am wrangling here with the output of various modules, in this case the LWP lib. "decode it properly to get 'perl's internal format' " means I use the $mess->decoded_content() function of HTTP::Message about which the doc says: "Returns the content with any Content-Encoding undone and strings mapped to perl's Unicode strings." For me this means: "internal format", which is in fact a utf8-encoding-dialect (but I should forget about that anyway..) The utf8 flag of the modules output is ON - and this is where the confusion happens: I thought that the utf8 flag is set although the data is really unicode octets.. But know I understand it as: decoded_content() returns utf8 encoded unicode (step 1), my perl script and its regexes should handle utf8 encoded unicode (step 2) - so everything is fine. And output should also be utf8 encoded unicode. Which it already is so I modified the step to skip the wrong encode step (new step 3) - am I doing it right now? For the interested reader: in fact I use storable to serialize my resulting data structure as whole, then I gzip the freeze'd data and write it to disk with a simple binmode (and thus not :utf8) filehandle. Any problems here? utf8 data and utf8-flag should stay intact over the pipeline.	[reply]
Re^2: The unicode / utf8 struggle, part 2: regexes by graff (Chancellor) on May 17, 2007 at 18:59 UTC
And output should also be utf8 encoded unicode. Which it already is so I modified the step to skip the wrong encode step (new step 3) - am I doing it right now? It would be easier to answer that if you showed us a relevant code snippet. And if you try the snippet yourself, that will probably answer the question. Check out this little unicode tool (shameless plug for a prog I posted recently), in case that helps to validate your data. For the interested reader: in fact I use storable to serialize my resulting data structure as whole, then I gzip the freeze'd data and write it to disk with a simple binmode (and thus not :utf8) filehandle. Any problems here? utf8 data and utf8-flag should stay intact over the pipeline. The utf8 flag is strictly a perl-internal attribute of scalar values. Once data is written to any sort of file (including any pipe), it's just data, and what happens to it after that point depends on what sort of process is reading it, and how that process chooses to interpret what is being read. There is a section of the Storable man page about utf8 (under the heading "FORWARD COMPATIBILITY"), which you should consult. It looks like it will "do the right thing" for you by default (retain the utf8 flag as part of the "freeze"d data structure so that a downstream "thaw" gets it), but it'll be worth testing to be sure. (I haven't used it, so I don't know.)	[reply]
Re: The unicode / utf8 struggle, part 2: regexes by mattr (Curate) on May 22, 2007 at 09:41 UTC
Hi, The above masterful comments are just that, but since I noticed this module in the CPAN Nodelet I thought I'd mention HTML::Encoding. Apparently it helps you figure out what encoding is coming in at you, using the function mentioned above. Might even work! But I haven't used it myself. Good luck! HTML::Encoding helps to determine the encoding of HTML and XML/XHTML documents... `use HTML::Encoding 'encoding_from_http_message'; use LWP::UserAgent; use Encode; my $resp = LWP::UserAgent->new->get('http://www.example.org'); my $enco = encoding_from_http_message($resp); my $utf8 = decode($enco => $resp->content);` [download]	[reply] [d/l]
Re: The unicode / utf8 struggle, part 2: regexes by Juerd (Abbot) on Jun 13, 2007 at 19:22 UTC
$internal_format_string =~ s/\n//g; and it removed some letters, spaces and a lot more! Huh? I'd like to see an example of that. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]