in reply to Re^2: Help encode_entities doesn't seem to work
in thread [SOLVED] -Help encode_entities doesn't seem to work

Hello Balawoo,

While writing this up, I saw that haukex beat me with a complete solution at Re^3: Help encode_entities doesn't seem to work. I agree that this is how the problem should be solved and therefore refrain from posting a copy of your script with minimal changes applied. Send me a message if you want this.

The spreadsheet and code you have given in your response to haukex's article are very helpful to track that down. So here are the issues (many of which have already been pointed out by haukex in Re: Help encode_entities doesn't seem to work):

After applying all of these changes, I end up with the following XML file:

<?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement><docid ><![CDATA[PP10-RG-010]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>1</node_order> <description><![CDATA[Le format et le contenu des 2 documents sont dé +crits dans la SFD XXX (JIRA 624).]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-020]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-030]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà<> aqwzsx]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> </requirements>

Replies are listed 'Best First'.
Re^4: Help encode_entities doesn't seem to work
by Balawoo (Novice) on Feb 10, 2019 at 16:50 UTC
    Hello superdoc,

    Thanks for your reply.
    I have update the code to use .xlsx file with

    # STEP1: The data from XLS file is stored in temp TXT file my $parser = Spreadsheet::ParseXLSX->new();

    I haven't change anything
    In my text file, I have also updated the code like
    $mac = encode("utf-8", $cell_unformatted); print_txt "$row;;$col;;", $mac ,"\n";
    I'm sticked about the decode part. I don't see how to solve it.
    On my XML like I said, I have rearrange a script provide for another object. I would like to skip my first line, but I don't underwent how to do it.
    For my text, I need to encode accent like é on & eacute I have found how.
    Thanks for all
    Balawoo

      Hello Balawoo,

      I admit that I'm having some difficulties relating your attempts to my recommendations.

      If you change the format to XSLX files, then there'll be no more MacRoman encoding: All strings in XLSX files are formatted in UTF-8. Furthermore, you don't need to decode anything, because Spreadsheet::ParseXSLX will do that for you. So, you've found another way to get rid of that problem.

      Your method to create the text file in UTF-8 (encoding the individual cells and then write with Perl's default encoding) sort of works, but I would really recommend that you open the file for UTF-8 encoding like this:

      open (TXT, ">:encoding(UTF-8)", $txt) || die("Could not open file! $txt");

      Of course, you need to read this file as UTF-8 as well:

      open (SOURCE, "<:encoding(UTF-8)", $txt) || die ("Could not open file! $txt");

      You still haven't convinced me that you need to encode accents like é to &eacute. If you write &eacute to a XML file, you get an invalid XML file. If you want to have the string &eacute as literal content of the XML element, then you need to encode twice: Once to convert é to &eacute, and a second time (use encode_entities without a second parameter for this) to convert the & character to &amp;. In the XML file you'll then see &amp;eacute, but an XML processor will read it as &eacute. Note that you still need to get the use utf8; thing right if you want to pass your string literal as a second parameter to encode_entities.