tod222 has asked for the wisdom of the Perl Monks concerning the following question:

HTML::Entities isn't correctly handling quotes as defined in this Unicode table.

This code:

#!/usr/bin/perl use strict; use warnings; use HTML::Entities; my $line = "This is a test of \xe2\x80\x9cquotes\xe2\x80\x9d\n"; print encode_entities($line, "\200-\377"); # looking for “ &rdqu +o; in the output print $line;
produces
This is a test of “quotes” This is a test of “quotes”
instead of the desired
This is a test of “quotes” This is a test of “quotes”
Any hints?

Replies are listed 'Best First'.
Re: HTML::Entities and Unicode quotes
by Your Mother (Archbishop) on Aug 20, 2011 at 02:06 UTC

    This will, I hope, explain what’s going on–

    use warnings; use strict; use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; print "Is $str UTF-8? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; $str = decode("UTF-8", $str, Encode::FB_CROAK); binmode STDOUT, ":encoding(UTF-8)"; print "It's still $str... UTF-8 now? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; my $wide_chars = "\x{201C}quotes\x{201D}"; print "How about this version: $wide_chars? ", Encode::is_utf8($wide_chars) ? "Yes!\n" : "No...\n"; print "Entities: ", encode_entities($str), $/; __END__ Is “quotes” UTF-8? No... It's still “quotes”... UTF-8 now? Yes! How about this version: “quotes”? Yes! Entities: “quotes”

    Update: changed $non_combining to $wide_chars as the name was misleading.

      The internal storage format (returned by is_utf8) has absolutely nothing to do with this.
      use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; utf8::downgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n"; utf8::upgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n";
      0 “quotes” 1 “quotes”

        It was just to show what the natural state of the strings was assumed to be by perl. You artificially flipped the switch on/off—no decoding or encoding. You also showed code using the functions of utf8 which is probably a bad example to set. You know exactly what you’re doing but someone who doesn’t sees a top monk using it they think, oh, that must be a good idea, I’ll use upgrade and downgrade to “fix” my encodings too.

      Thank you for this excellent response. I found it quite illuminating.

      Before posting I'd spent about 30 minutes reading perlunifaq and searching here on Perlmonks without things getting much clearer. In fact, some of what I read here was a bit disconcerting; the complaints that Perl no longer 'just worked' seemed apropos.

      One source of my original confusion was that I had a file containing \xe2\x80\x9c and \xe2\x80\x9d sequences when examined using 'od -t x1 foo2' which would display correctly on Ubuntu with 'cat' in gterm. Since the Unicode table I linked showed that the sequences were valid representations of “ and ” I wondered why HTML::Entities wasn't handling it correctly, particularly when cat could.

      Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

      A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

      The other is that I'd like for Perl to 'just work' to whatever extent possible. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

        Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this.

        Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed.

        A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called?

        It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK.

        Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?

        There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for open.

Re: HTML::Entities and Unicode quotes
by ikegami (Patriarch) on Aug 20, 2011 at 06:35 UTC

    encode_entities expects a string of text for argument. As far as it's concerned, you passed the following 12 characters:

    U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0080 An unnamed control character U+009C STRING TERMINATOR, a control character q u o t e s U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0080 An unnamed control character U+009D OPERATING SYSTEM COMMAND, a control character

    You want to pass

    U+201C LEFT DOUBLE QUOTATION MARK q u o t e s U+201D RIGHT DOUBLE QUOTATION MARK

    So instead of

    encode_entities("\xe2\x80\x9cquotes\xe2\x80\x9d")

    you should have used

    encode_entities("\x{201C}quotes\x{201D}")

    What you passed is the bytes resulting from encoding "\x{201C}quotes\x{201D}" using UTF-8, so you could also use

    encode_entities(decode("UTF-8", "\xe2\x80\x9cquotes\xe2\x80\x9d"))

    decode comes from Encode.

Re: HTML::Entities and Unicode quotes
by Anonymous Monk on Aug 20, 2011 at 02:08 UTC

    HTML::Entities isn't correctly handling quotes as defined in this Unicode table.

    A poor workman blames his tools :)

    #!/usr/bin/perl -- use strict; use warnings; use utf8; use HTML::Entities; binmode STDOUT, ':encoding(UTF-8)'; { my $line = "This is a test of \xe2\x80\x9cquotes\xe2\x80\x9d\n"; print encode_entities($line, "\200-\377"); # looking for “ & +rdquo; in the output print $line; } print '#' x 11, "\n"; { my $line = 'xThis is a test of '.chr(8220).'quotes'.chr(8222)."\n" +; print encode_entities($line, '\x{201c}\x{201e}'); print $line; } print '#' x 11, "\n"; { my $line = 'xThis is a test of '.chr(8220).'quotes'.chr(8222)."\n" +; print encode_entities($line, chr(8220).chr(8222)); print $line; } __END__ This is a test of “quotes” This is a test of “quotes” ########### xThis is a test of “quotes„ xThis is a test of “quotes„ ########### xThis is a test of “quotes„ xThis is a test of “quotes„