in reply to Re^3: performance of length() in utf-8
in thread performance of length() in utf-8
Does that help?In some way, but not completely. :op
I am in Win1252 and the code 199 (= 0xc7) corresponds to the upper-case c-cedilla character. Okay.chcp Active code page: 1252 perl -e "print chr 199" Ç perl -e "print join ' ', map {sprintf '%02x', $_} unpack 'C*', chr 199 +" c7
So if encode the byte 199 to utf-8 (I seem to understand "from the current console codepage"), I get the values c3 87 that correspond to the U+00c7 unicode "LATIN CAPITAL LETTER C WITH CEDILLA". I still follow.perl -MEncode -e "print Encode::encode_utf8 chr 199" Ç perl -MEncode -e "print join ' ', map {sprintf '%02x', $_} unpack 'C*' +, Encode::encode_utf8 chr 199" c3 87
If I decode a raw "c3 87" I get back my "Ç", so everything is how I suppose it to be.perl -MEncode -e "print Encode::decode_utf8 \"\xc3\x87\"" Ç
Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length.Well... Not sure: Here is a simple utf8-1.xml file:
(to be sure, if hex-editing the file, we see actually C3 87 in the place of the char 199)<?xml version="1.0" encoding="utf-8"?> <root>Ç foo</root>
I can see:use strict; use warnings; use feature 'say'; #~ use utf8; use XML::SAX::ParserFactory; $|++; #to force one kind of parser for ParserFactory->parser() #~ $XML::SAX::ParserPackage = "XML::SAX::PurePerl"; #~ $XML::SAX::ParserPackage = "XML::SAX::Expat"; #no xml_decl #~ $XML::SAX::ParserPackage = "XML::SAX::ExpatXS"; #~ $XML::SAX::ParserPackage = "XML::LibXML::SAX"; $XML::SAX::ParserPackage = "XML::LibXML::SAX::Parser"; { package MySax; use feature 'say'; use Devel::Peek; sub new { my $class = shift; return bless {}, $class; } sub hexprint { my ($self, $data) = @_; join ' ', map { sprintf '%02X', $_ } unpack 'C*', $data; } sub characters { my ($self, $data) = @_; my $content = $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); } } my $handler = new MySax; my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); say "parser is " . ref $parser; say "file: " . $ARGV[0] if $ARGV[0]; $parser->parse_file($ARGV[0] // *DATA); __DATA__ <empty/>
Can I assume the following:perl sax_utf.pl utf8-1.xml parser is XML::LibXML::SAX::Parser file: utf8-1.xml characters for elt: Ç foo bytes for elt: C7 20 66 6F 6F SV = PV(0x288c658) at 0x233d2e8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2b28228 "\303\207 foo"\0 [UTF8 "\x{c7} foo"] CUR = 6 LEN = 10 COW_REFCNT = 1
Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection.If the string is already in utf-8, why processing it with encode_utf8 ?
sub characters { use Encode; my ($self, $data) = @_; my $content = Encode::encode_utf8 $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); }
So unpacking the string shows the expected C3 87 bytes for the char 199, confirmed by the octal dum, but the UTF8 flag has vanished? I'm puzzled!characters for elt: Ç foo bytes for elt: C3 87 20 66 6F 6F SV = PV(0x28ba328) at 0x236d2b8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2b548b8 "\303\207 foo"\0 CUR = 6 LEN = 10 COW_REFCNT = 1
Now I am not sure of the byte representation:parser is XML::LibXML::SAX::Parser file: utf8-2.xml Wide character in say at sax_utf.pl line 36. characters for elt: Ç foo € bytes for elt: C7 20 66 6F 6F 20 20AC SV = PV(0x2a61748) at 0x250ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2cefc98 "\303\207 foo \342\202\254"\0 [UTF8 "\x{c7} foo \x{20 +ac}"] CUR = 10 LEN = 12 COW_REFCNT = 1
While I still do not understand the missing UTF8 flag...parser is XML::LibXML::SAX::Parser file: utf8-2.xml characters for elt: Ç foo € bytes for elt: C3 87 20 66 6F 6F 20 E2 82 AC SV = PV(0x2991768) at 0x243ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2c1fc98 "\303\207 foo \342\202\254"\0 CUR = 10 LEN = 12 COW_REFCNT = 1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: performance of length() in utf-8
by kennethk (Abbot) on Mar 11, 2016 at 21:40 UTC | |
by hippo (Archbishop) on Mar 11, 2016 at 23:21 UTC |