in reply to Re: Help encode_entities doesn't seem to work
in thread [SOLVED] -Help encode_entities doesn't seem to work

Thanks for your help

My need is to transform an excel file to XML. Like I have said and as I'm a newbie. I'm starting based on a script working for a non latin language and try to adapt it.


Of course I have seen my library to parse Excel file is not the best one, but like it seems working I have continue to use is.
Testlink it's my aim, I need to fill file to be able to upload some data. My target file for requirement is:

<?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement> <docid><![CDATA[ENG-0001]]></docid> <title><![CDATA[Terrestrial Propulsor]]></title> <description><![CDATA[]]></description> </requirement> <requirement> <docid><![CDATA[ENG-0002]]></docid> <title><![CDATA[Main Deflector]]></title> <description><![CDATA[<p>Maindeflector bla, bla, bla.</p>]]> </description> </requirement> </requirements>

Replies are listed 'Best First'.
Re^3: Help encode_entities doesn't seem to work
by haj (Vicar) on Feb 10, 2019 at 11:38 UTC

    Hello Balawoo,

    While writing this up, I saw that haukex beat me with a complete solution at Re^3: Help encode_entities doesn't seem to work. I agree that this is how the problem should be solved and therefore refrain from posting a copy of your script with minimal changes applied. Send me a message if you want this.

    The spreadsheet and code you have given in your response to haukex's article are very helpful to track that down. So here are the issues (many of which have already been pointed out by haukex in Re: Help encode_entities doesn't seem to work):

    • I downloaded your source file, and it turned out to be UTF-8. In that case, since you have non-ascii-characters in your source file, you must announce this to the Perl interpreter with use utf8;. But as the next point shows, you might get away without them anyway.
    • If you write an XML file as UTF-8, then you don't need to encode any characters to their entities (it doesn't work anyway, as haukex points out, because XML doesn't know about these named entities, nor does XML have a <br> element).
    • Now for the tricky part: The cells in your Excel sheet are encoded in a "native" character set ($cell->encoding returns 3), and it can be tricky to divine which native set. In your case, it seems that it is one of the encodings which is not detected and handled by Spreadsheet::ParseExcel. I got pretty far by assuming it is MacRoman, because then the cell F2 translates properly to Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624). Apparently you need to decode the values by yourself, fortunately the Encode module knows about MacRoman.
    • As I already wrote, you need to write your XML files in UTF-8-format if you declare it to be so, and I recommend to do the same for your intermediate text file though it isn't strictly necessary as long as all your data can be expressed in iso-latin-1 as well.
    • Finally, your resulting XML file is invalid. The reason is that you skip cell A1 (only this cell has (($cell_row_position == 0) and ($cell_col_position == 0)) while your comment says that you want to Skip cells from Row1 and Column A - reserved for Header and comments. I doubt about column A which contains the docid, so you probably just want to skip the first row.

    After applying all of these changes, I end up with the following XML file:

    <?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement><docid ><![CDATA[PP10-RG-010]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>1</node_order> <description><![CDATA[Le format et le contenu des 2 documents sont dé +crits dans la SFD XXX (JIRA 624).]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-020]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-030]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà<> aqwzsx]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> </requirements>
      Hello superdoc,

      Thanks for your reply.
      I have update the code to use .xlsx file with

      # STEP1: The data from XLS file is stored in temp TXT file my $parser = Spreadsheet::ParseXLSX->new();

      I haven't change anything
      In my text file, I have also updated the code like
      $mac = encode("utf-8", $cell_unformatted); print_txt "$row;;$col;;", $mac ,"\n";
      I'm sticked about the decode part. I don't see how to solve it.
      On my XML like I said, I have rearrange a script provide for another object. I would like to skip my first line, but I don't underwent how to do it.
      For my text, I need to encode accent like é on & eacute I have found how.
      Thanks for all
      Balawoo

        Hello Balawoo,

        I admit that I'm having some difficulties relating your attempts to my recommendations.

        If you change the format to XSLX files, then there'll be no more MacRoman encoding: All strings in XLSX files are formatted in UTF-8. Furthermore, you don't need to decode anything, because Spreadsheet::ParseXSLX will do that for you. So, you've found another way to get rid of that problem.

        Your method to create the text file in UTF-8 (encoding the individual cells and then write with Perl's default encoding) sort of works, but I would really recommend that you open the file for UTF-8 encoding like this:

        open (TXT, ">:encoding(UTF-8)", $txt) || die("Could not open file! $txt");

        Of course, you need to read this file as UTF-8 as well:

        open (SOURCE, "<:encoding(UTF-8)", $txt) || die ("Could not open file! $txt");

        You still haven't convinced me that you need to encode accents like é to &eacute. If you write &eacute to a XML file, you get an invalid XML file. If you want to have the string &eacute as literal content of the XML element, then you need to encode twice: Once to convert é to &eacute, and a second time (use encode_entities without a second parameter for this) to convert the & character to &amp;. In the XML file you'll then see &amp;eacute, but an XML processor will read it as &eacute. Note that you still need to get the use utf8; thing right if you want to pass your string literal as a second parameter to encode_entities.