in reply to Re^2: Do I have a unicode problem, or is this something else?
in thread Do I have a unicode problem, or is this something else?

Hi ikegami,

Thanks for that. So I understand that this is a decimal code, although I'm not sure what U+00ED means.

a) Is there a function like the decode function which will parse a variable and replace these strings with the correct unicode characters?

b) What is this style of encoding called so I can do a google on it.

Regards

Steve

Replies are listed 'Best First'.
Re^4: Do I have a unicode problem, or is this something else?
by ikegami (Patriarch) on Jun 10, 2010 at 23:05 UTC

    although I'm not sure what U+00ED means.

    Unicode character 00ED hex.

    What is this style of encoding called so I can do a google on

    XML. Specifically, it's an XML entity.

    Is there a function like the decode function which will parse a variable and replace these strings with the correct unicode characters?

    It is the correct unicode character.

    But if you wish to expand the entities, an easy way is to use XML::LibXML since it doesn't use entities unless required.

    use strict; use warnings; use XML::LibXML qw( ); my $xml = '<?xml version="1.0"?><root>&#237;</root>'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xml); $doc->setEncoding('UTF-8'); open(my $fh, '>:bytes', 'xml') or die; print($fh $doc->toString);

      Hi ikegami,

      Thanks very much for your reply. I didn't have XML::LibXML installed on my PC (Kubuntu), so I went into cpan and installed it, but cpan is complaining:

      It says no Makfile, and it's right. So I went into the directory. There is a Makefile.PL, So I executed it and I got "Makefile.PL: command not found":

      root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX# dir Av_CharPtrPtr.c Changes docs dom.h lib LibXML.pod LICEN +SE MANIFEST perl-libxml-mm.c perl-libxml-sax.c ppport.h t + TODO xpath.c xpath.h Av_CharPtrPtr.h debian dom.c example LibXML.pm LibXML.xs Makef +ile.PL META.yml perl-libxml-mm.h perl-libxml-sax.h README test + typemap xpathcontext.h root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX# Makefile.PL Makefile.PL: command not found root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX#

      I'm now looking for another Parser - maybe I could just use a regular expression?

      Update I've tried this regular expression and it seems to work.

      #!/usr/bin/perl -w use strict; use warnings; my $xml = '<?xml version="1.0"?><root>&#237;</root>'; print($xml,"\n"); $xml =~ s/\&\#(\d*);/chr($1)/gse; print($xml,"\n");

      So thanks again for pointing me in the right direction, ikegami, as always.

      Regards

      Steve

        Two major problems:

        • You didn't encode the character using the XML's encoding before inserting it in the XML. You didn't even check if the XML's encoding could encode the character.
        • You're decoding entities that were encoded because the character they represent would break the XML if present. (e.g. &#38;).

        Your solution also has some potential bugs.

        • You convert what appears to be entities in CDATA sections. (Your XML generator might not produce these.)
        • You don't expand &iacute;. (Your XML generator might not produce these.)
        • You don't expand &#xED;. (Your XML generator might not produce these.)
        • The XML is encoded, but you try to match against it as if it was text. (You're matching ASCII character and your XML generator might always use an ASCII-derived encoding).

        On the stylistic side,

        • \d is way too encompassing. You want [0-9]. (I'm listing this as stylistic since it won't be an issue with valid XML.)
        • You have a useless modifier on your match operator.
        • You have useless escapes in your pattern.

        Update: Added second major problem.

        Most likely, Makefile.PL is not executable. You are supposed to run it like this:

        perl -w Makefile.PL