Sombrerero_loco has asked for the wisdom of the Perl Monks concerning the following question:

Hi there. Im facing a encoding issue that i dont know how fix it. Im reading records from a file and reading a file. try to match the think i read from the file, to one of the records i had readed from the file, if found, it use a code assigned an write to the file. When i read the data to be loaded in a hash with his code and get ready to be compared against the data extracted from the files, ir read the data right. I use a subrroutine to convert every quote or accent to the same letter without this special character.
sub translator { + $vNombre = $_[0]; $vNombre = lc $vNombre; $vNombre =~ tr/áàèéìíòóùú/aaeeiioouu/; $vNombre =~ tr/ÀÁÈÉÌÍÒÓÙÚ/AAEEIIOOUU/; $vNombre =~ tr/"'/||/d; return $vNombre; }
But, when i read the xml has a plain text (using a While <FILEHANDLE>), i sent the data to the subrroutine to format it, using that condition:
if ($vl_read =~ /<hwAssetUserField3 type="attrib">(.+?)<\/hwAssetUserF +ield3>/){ $vNombre = $1; $vNombre = &translator($vNombre);
But when i print the data, it seems it takes the accent, quotes and so on, not as a ó or á letter, or a "" or '', its use weird characters.
&apos;galicia vii&apos; it should be "galicia vii"
or at last, it should detect " in the tr and change it to another character of (see the /d option), eliminate, but no. Also, if it reads a accent vowel, it reads, for example:
centro de apoyo inform&#9500;ítico when it should read centro de apoyo + informatico.
Im using a spanish computer, but it seems perl its reading in a wrong way from the xml file but not from a txt file. Any idea???? Thanks!

Replies are listed 'Best First'.
Re: Problem with quotes, speciao characters and so on, reading a xml file
by almut (Canon) on Dec 29, 2009 at 11:01 UTC

    Not really sure, but maybe you're looking for something like HTML::Entities (or XML::Entities) to decode the entity representations of those special characters (such as &apos;), before applying your substitutions.

Re: Problems with XML encoding
by FalseVinylShrub (Chaplain) on Dec 29, 2009 at 11:03 UTC

    Hi

    How are you reading the XML file? It looks like you're just reading it in, not processing it as XML.

    XML allows characters to be encoded with &amp;#1234; encoding. This will be converted by an XML Parser, but Perl does not understand these codes by itself.

    I would suggest using an XML Parser module, such as XML::LibXML otherwise you are likely to encounter similar little problems to this (example: what if your hwAssetUserField3 element is split over multiple lines).

    However, if you insist on doing it yourself, you could solve this particular problem with something like:

    # not recommended! ... or tested ;-) $vNombre =~ s{&#([0-9]+);}{chr($1)}g;

    I would really suggest finding an XML parser, or at least find an XML character reference converter someone else has written, because you may also need to deal with hexadecimal (&amp;#x0A0A;) and named character entities (&amp;aacute;).

    FalseVinylShrub

    Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      Hi. I dont really need to read it as an xml, because i only want to do some substitutions. This its the weird line as it is in the xml file:
      <hwAssetUserField3 type="attrib">CENTRO DE APOYO INFORMáTICO </h +wAssetUserField3>
      As you can see, in the xml file, it seems to be a valid format. I dont care about the encoding because im reading the file as a normal file, not as an xml file, it means, line by line, to do some "raw" operation and rewrite in another file. Thanks anyway

        Hi

        Hmm in that case I think I misunderstood your problem. Though I still think you should use some XML technology ;-) if you are doing simple substitutions, could you do it using XSLT?

        However, perhaps your problem is not with XML representations but with reading Unicode in. Assuming you're using Perl v5.8-v5.10, how are you opening the file? You need to tell Perl the encoding - presumably UTF-8.

        You can do this in a number of ways:
        # use binmode on the filehandle open my $fh, '<', "file" or die "... $!"; binmode $fh, ':utf8'; # open $fh for reading UTF-8 open(my $fh, "<:encoding(UTF-8)", "file") or die "... $!"; # Use the open pragma to open all input files as UTF-8 # see http://perldoc.perl.org/open.html use open IN => ':utf8'; # or you can manually use ... $str = decode_utf8( $str ); # on each data item

        In your case, easiest to use binmode on the filehandle - at least to find out if this is the problem.

        There are many documents trying to explain unicode in Perl. I quite like this one. Be aware that unicode support and the surrounding issues have changed quite a lot with the versions. v5.6 is completely different to the above, for example.

        FalseVinylShrub

        Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

        No matter whether you want to extract data or do some transformations you should NOT attempt to do it without an XML parser. If XSLT looks incomprehensible to you (it does to me) and XML::LibXML::SAX as well, try for example XML::Twig or XML::Rules. Maybe one of them will make sense to you. There are examples on this site and elsewhere.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.