Hello Balawoo,
While writing this up, I saw that haukex beat me with a complete solution at Re^3: Help encode_entities doesn't seem to work. I agree that this is how the problem should be solved and therefore refrain from posting a copy of your script with minimal changes applied. Send me a message if you want this.
The spreadsheet and code you have given in your response to haukex's article are very helpful to track that down.
So here are the issues (many of which have already been pointed out by haukex in Re: Help encode_entities doesn't seem to work):
- I downloaded your source file, and it turned out to be UTF-8. In that case, since you have non-ascii-characters in your source file, you must announce this to the Perl interpreter with use utf8;. But as the next point shows, you might get away without them anyway.
- If you write an XML file as UTF-8, then you don't need to encode any characters to their entities (it doesn't work anyway, as haukex points out, because XML doesn't know about these named entities, nor does XML have a <br> element).
- Now for the tricky part: The cells in your Excel sheet are encoded in a "native" character set ($cell->encoding returns 3), and it can be tricky to divine which native set. In your case, it seems that it is one of the encodings which is not detected and handled by Spreadsheet::ParseExcel. I got pretty far by assuming it is MacRoman, because then the cell F2 translates properly to Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624). Apparently you need to decode the values by yourself, fortunately the Encode module knows about MacRoman.
- As I already wrote, you need to write your XML files in UTF-8-format if you declare it to be so, and I recommend to do the same for your intermediate text file though it isn't strictly necessary as long as all your data can be expressed in iso-latin-1 as well.
- Finally, your resulting XML file is invalid. The reason is that you skip cell A1 (only this cell has (($cell_row_position == 0) and ($cell_col_position == 0)) while your comment says that you want to Skip cells from Row1 and Column A - reserved for Header and comments. I doubt about column A which contains the docid, so you probably just want to skip the first row.
After applying all of these changes, I end up with the following XML file:
<?xml version="1.0" encoding="UTF-8"?>
<requirements>
<requirement><docid ><![CDATA[PP10-RG-010]]></docid>
<title><![CDATA[MASTER DATA]]></title>
<version>1</version>
<revision> 1 </revision>
<node_order>1</node_order>
<description><![CDATA[Le format et le contenu des 2 documents sont dé
+crits dans la SFD XXX (JIRA 624).]]></description>
<status><![CDATA[V]]></status>
<type><![CDATA[3]]></type>
<expected_coverage><![CDATA[1]]></expected_coverage>
</requirement>
<requirement><docid ><![CDATA[PP10-RG-020]]></docid>
<title><![CDATA[MASTER DATA]]></title>
<version>1</version>
<revision> 1 </revision>
<node_order>2</node_order>
<description><![CDATA[éiùûôêçà]]></description>
<status><![CDATA[V]]></status>
<type><![CDATA[3]]></type>
<expected_coverage><![CDATA[1]]></expected_coverage>
</requirement>
<requirement><docid ><![CDATA[PP10-RG-030]]></docid>
<title><![CDATA[MASTER DATA]]></title>
<version>1</version>
<revision> 1 </revision>
<node_order>2</node_order>
<description><![CDATA[éiùûôêçà<>
aqwzsx]]></description>
<status><![CDATA[V]]></status>
<type><![CDATA[3]]></type>
<expected_coverage><![CDATA[1]]></expected_coverage>
</requirement>
</requirements>
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.