in reply to Re^3: Do I have a unicode problem, or is this something else?
in thread Do I have a unicode problem, or is this something else?

although I'm not sure what U+00ED means.

Unicode character 00ED hex.

What is this style of encoding called so I can do a google on

XML. Specifically, it's an XML entity.

Is there a function like the decode function which will parse a variable and replace these strings with the correct unicode characters?

It is the correct unicode character.

But if you wish to expand the entities, an easy way is to use XML::LibXML since it doesn't use entities unless required.

use strict; use warnings; use XML::LibXML qw( ); my $xml = '<?xml version="1.0"?><root>&#237;</root>'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($xml); $doc->setEncoding('UTF-8'); open(my $fh, '>:bytes', 'xml') or die; print($fh $doc->toString);

Replies are listed 'Best First'.
Re^5: Do I have a unicode problem, or is this something else?
by Steve_BZ (Chaplain) on Jun 11, 2010 at 18:01 UTC

    Hi ikegami,

    Thanks very much for your reply. I didn't have XML::LibXML installed on my PC (Kubuntu), so I went into cpan and installed it, but cpan is complaining:

    It says no Makfile, and it's right. So I went into the directory. There is a Makefile.PL, So I executed it and I got "Makefile.PL: command not found":

    root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX# dir Av_CharPtrPtr.c Changes docs dom.h lib LibXML.pod LICEN +SE MANIFEST perl-libxml-mm.c perl-libxml-sax.c ppport.h t + TODO xpath.c xpath.h Av_CharPtrPtr.h debian dom.c example LibXML.pm LibXML.xs Makef +ile.PL META.yml perl-libxml-mm.h perl-libxml-sax.h README test + typemap xpathcontext.h root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX# Makefile.PL Makefile.PL: command not found root@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX#

    I'm now looking for another Parser - maybe I could just use a regular expression?

    Update I've tried this regular expression and it seems to work.

    #!/usr/bin/perl -w use strict; use warnings; my $xml = '<?xml version="1.0"?><root>&#237;</root>'; print($xml,"\n"); $xml =~ s/\&\#(\d*);/chr($1)/gse; print($xml,"\n");

    So thanks again for pointing me in the right direction, ikegami, as always.

    Regards

    Steve

      Two major problems:

      • You didn't encode the character using the XML's encoding before inserting it in the XML. You didn't even check if the XML's encoding could encode the character.
      • You're decoding entities that were encoded because the character they represent would break the XML if present. (e.g. &#38;).

      Your solution also has some potential bugs.

      • You convert what appears to be entities in CDATA sections. (Your XML generator might not produce these.)
      • You don't expand &iacute;. (Your XML generator might not produce these.)
      • You don't expand &#xED;. (Your XML generator might not produce these.)
      • The XML is encoded, but you try to match against it as if it was text. (You're matching ASCII character and your XML generator might always use an ASCII-derived encoding).

      On the stylistic side,

      • \d is way too encompassing. You want [0-9]. (I'm listing this as stylistic since it won't be an issue with valid XML.)
      • You have a useless modifier on your match operator.
      • You have useless escapes in your pattern.

      Update: Added second major problem.

        Hi ikegami,

        Thanks for this. I'll start with the end first.

        The style points: I'm never quite sure which characters need escape sequences and which don't, so thanks for the clarifications there. All good points and I'll incorporate them.

        Potential bugs: it's true. Not to mention '& amp' etc. But I don't expect to see these here, and if I do, they'll come up in testing. I'm not sure I understand the last point.

        Intro: I don't have any control over the generation. It's done for me, so I can't do it any other way (unless you think I can).

        Have a good day.

        Regards

        Steve

      Most likely, Makefile.PL is not executable. You are supposed to run it like this:

      perl -w Makefile.PL

        Hi Corion,

        Nice to hear from you. I think that ikegami is probably right and I need to do this XML thing properly. I tried the command line you suggested and it gave me this error:

        steve@steve-desktop:~/.cpan/build/XML-LibXML-1.70-XzsnvX$ perl -w Make +file.PL Name "main::is_win32" used only once: possible typo at Makefile.PL lin +e 263. enable native perl UTF8 running xml2-config... using fallback values for LIBS and INC options: LIBS='-L/usr/local/lib -L/usr/lib -lxml2 -lm' INC='-I/usr/local/include -I/usr/include' If this is wrong, Re-run as: $ /usr/bin/perl Makefile.PL LIBS='-L/path/to/lib' INC='-I/path/to/in +clude' looking for -lxml2... no looking for -llibxml2... no libxml2 not found Try setting LIBS and INC values on the command line Or get libxml2 from http://xmlsoft.org/ If you install via RPMs, make sure you also install the -devel RPMs, as this is where the headers (.h files) are. Also, you may try to run perl Makefile.PL with the DEBUG=1 parameter to see the exact reason why the detection of libxml2 installation failed or why Makefile.PL was not able to compile a test program.

        Maybe this only runs on Windows? What do you think?

        Regards

        Steve