in reply to Problem with quotes, speciao characters and so on, reading a xml file

Hi

How are you reading the XML file? It looks like you're just reading it in, not processing it as XML.

XML allows characters to be encoded with Ӓ encoding. This will be converted by an XML Parser, but Perl does not understand these codes by itself.

I would suggest using an XML Parser module, such as XML::LibXML otherwise you are likely to encounter similar little problems to this (example: what if your hwAssetUserField3 element is split over multiple lines).

However, if you insist on doing it yourself, you could solve this particular problem with something like:

# not recommended! ... or tested ;-) $vNombre =~ s{&#([0-9]+);}{chr($1)}g;

I would really suggest finding an XML parser, or at least find an XML character reference converter someone else has written, because you may also need to deal with hexadecimal (ਊ) and named character entities (á).

FalseVinylShrub

Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

Replies are listed 'Best First'.
Re^2: Problems with XML encoding
by Sombrerero_loco (Beadle) on Dec 29, 2009 at 11:49 UTC
    Hi. I dont really need to read it as an xml, because i only want to do some substitutions. This its the weird line as it is in the xml file:
    <hwAssetUserField3 type="attrib">CENTRO DE APOYO INFORMáTICO </h +wAssetUserField3>
    As you can see, in the xml file, it seems to be a valid format. I dont care about the encoding because im reading the file as a normal file, not as an xml file, it means, line by line, to do some "raw" operation and rewrite in another file. Thanks anyway

      Hi

      Hmm in that case I think I misunderstood your problem. Though I still think you should use some XML technology ;-) if you are doing simple substitutions, could you do it using XSLT?

      However, perhaps your problem is not with XML representations but with reading Unicode in. Assuming you're using Perl v5.8-v5.10, how are you opening the file? You need to tell Perl the encoding - presumably UTF-8.

      You can do this in a number of ways:
      # use binmode on the filehandle open my $fh, '<', "file" or die "... $!"; binmode $fh, ':utf8'; # open $fh for reading UTF-8 open(my $fh, "<:encoding(UTF-8)", "file") or die "... $!"; # Use the open pragma to open all input files as UTF-8 # see http://perldoc.perl.org/open.html use open IN => ':utf8'; # or you can manually use ... $str = decode_utf8( $str ); # on each data item

      In your case, easiest to use binmode on the filehandle - at least to find out if this is the problem.

      There are many documents trying to explain unicode in Perl. I quite like this one. Be aware that unicode support and the surrounding issues have changed quite a lot with the versions. v5.6 is completely different to the above, for example.

      FalseVinylShrub

      Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      No matter whether you want to extract data or do some transformations you should NOT attempt to do it without an XML parser. If XSLT looks incomprehensible to you (it does to me) and XML::LibXML::SAX as well, try for example XML::Twig or XML::Rules. Maybe one of them will make sense to you. There are examples on this site and elsewhere.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.