in reply to [SOLVED] -Help encode_entities doesn't seem to work

In addition to what haukex wrote: there are some items in your code which you might want to inspect.

The character literal "àéèçûîùô" can be expressed in either UTF-8 or iso-latin-1. If you save your source file containing that literal as UTF-8, then you must also use utf8;, if you save it as iso-latin-1, you must not use utf8.

Applied to your problem: In your simple example, both the characters given to encode_entities and the string to encode are in the same file, so they have the same encoding. In your longer program, the list of characters to encode is in your source code, but the characters to encode are coming from your text file. Therefore there is a chance of a mismatch, and as a first guess which explains your symptoms I'd say that your source file is saved as with UTF-8 encoding but without use utf8;.

You also should be aware that Excel files in those "old" formats are usually not UTF-8 encoded. You don't need to take care for this because Spreadsheet::ParseExcel does it for you. But you need to take care for the format you are writing: By not specifying an encoding for both your text and XML files, you get Perl's default, iso-latin-1 encoding. That doesn't hurt for the text file, since for the characters in question you should not see warnings like "Wide character in print". On the other hand, XML files are, per default or in your case per explicit declaration, UTF-8 encoded. To get that right, you should open the XML file like this:

open (XML, ">:encoding(UTF8",  $xml) || die("Could not open file! $xml: '$!'");

BTW: In contrast to haukex, I don't think that Devel::Peek or the UTF-8-flag are very helpful in hunting down these problems unless you are programming on XS level.

Replies are listed 'Best First'.
Re^2: Help encode_entities doesn't seem to work
by Balawoo (Novice) on Feb 10, 2019 at 09:42 UTC

    Thanks for your help

    My need is to transform an excel file to XML. Like I have said and as I'm a newbie. I'm starting based on a script working for a non latin language and try to adapt it.


    Of course I have seen my library to parse Excel file is not the best one, but like it seems working I have continue to use is.
    Testlink it's my aim, I need to fill file to be able to upload some data. My target file for requirement is:

    <?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement> <docid><![CDATA[ENG-0001]]></docid> <title><![CDATA[Terrestrial Propulsor]]></title> <description><![CDATA[]]></description> </requirement> <requirement> <docid><![CDATA[ENG-0002]]></docid> <title><![CDATA[Main Deflector]]></title> <description><![CDATA[<p>Maindeflector bla, bla, bla.</p>]]> </description> </requirement> </requirements>

      Hello Balawoo,

      While writing this up, I saw that haukex beat me with a complete solution at Re^3: Help encode_entities doesn't seem to work. I agree that this is how the problem should be solved and therefore refrain from posting a copy of your script with minimal changes applied. Send me a message if you want this.

      The spreadsheet and code you have given in your response to haukex's article are very helpful to track that down. So here are the issues (many of which have already been pointed out by haukex in Re: Help encode_entities doesn't seem to work):

      • I downloaded your source file, and it turned out to be UTF-8. In that case, since you have non-ascii-characters in your source file, you must announce this to the Perl interpreter with use utf8;. But as the next point shows, you might get away without them anyway.
      • If you write an XML file as UTF-8, then you don't need to encode any characters to their entities (it doesn't work anyway, as haukex points out, because XML doesn't know about these named entities, nor does XML have a <br> element).
      • Now for the tricky part: The cells in your Excel sheet are encoded in a "native" character set ($cell->encoding returns 3), and it can be tricky to divine which native set. In your case, it seems that it is one of the encodings which is not detected and handled by Spreadsheet::ParseExcel. I got pretty far by assuming it is MacRoman, because then the cell F2 translates properly to Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624). Apparently you need to decode the values by yourself, fortunately the Encode module knows about MacRoman.
      • As I already wrote, you need to write your XML files in UTF-8-format if you declare it to be so, and I recommend to do the same for your intermediate text file though it isn't strictly necessary as long as all your data can be expressed in iso-latin-1 as well.
      • Finally, your resulting XML file is invalid. The reason is that you skip cell A1 (only this cell has (($cell_row_position == 0) and ($cell_col_position == 0)) while your comment says that you want to Skip cells from Row1 and Column A - reserved for Header and comments. I doubt about column A which contains the docid, so you probably just want to skip the first row.

      After applying all of these changes, I end up with the following XML file:

      <?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement><docid ><![CDATA[PP10-RG-010]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>1</node_order> <description><![CDATA[Le format et le contenu des 2 documents sont dé +crits dans la SFD XXX (JIRA 624).]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-020]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-030]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà<> aqwzsx]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> </requirements>
        Hello superdoc,

        Thanks for your reply.
        I have update the code to use .xlsx file with

        # STEP1: The data from XLS file is stored in temp TXT file my $parser = Spreadsheet::ParseXLSX->new();

        I haven't change anything
        In my text file, I have also updated the code like
        $mac = encode("utf-8", $cell_unformatted); print_txt "$row;;$col;;", $mac ,"\n";
        I'm sticked about the decode part. I don't see how to solve it.
        On my XML like I said, I have rearrange a script provide for another object. I would like to skip my first line, but I don't underwent how to do it.
        For my text, I need to encode accent like é on & eacute I have found how.
        Thanks for all
        Balawoo