Balawoo has asked for the wisdom of the Perl Monks concerning the following question:

hello I'm a newbie and French. I need to convert and Excel file in XML file with text in HTML. So I need to convert accent in HTML. For information, this post is double in a French website

https://www.developpez.net/forums/d1941017/autres-langages/perl/langage/impossible-convertir-cellule-html-accents-convertis/ When I do a test script like this one, it's working

#!/usr/bin/perl use HTML::Entities; use warnings; #use utf8; #binmode(stdout => utf8); my $unsafe_chars = "< & >"; my $string="<br>àéèçûîùô<>"; print $string, "\n"; print encode_entities($string, "àéèçûîùô");

When I do the same based on an excel file

#!/usr/bin/perl -w ####################################################### # # Author: Polina Antipova aka Kitsune # # What it does: # This script converts XLS files to XML format compatible with Tes +tlink. # # More info at https://bitbucket.org/kitsuneo/xls2testlink-script/wiki # Suggestion/bugs are also accepted at polina.antipova@gmail.com # # !! Script is very data sensible. Please, be careful with input XLS f +ile !! # ######################################################## use warnings; use Encode qw/encode decode/; use Spreadsheet::ParseExcel; use File::Basename; use feature qw(say); use feature "switch"; #use Text::EtText::EtText2HTML; #use HTML::TextToHTML; use HTML::Entities; sub log_and_print { print (scalar(localtime()), " ", @_, "\n"); print LOGFILE (scalar(localtime()), " ", @_, "\n"); } sub print_log { print LOGFILE (scalar(localtime()), " ", @_, "\n"); } sub print_txt { print TXT (@_); } sub print_xml { print XML (@_); } # Source file and path are taken as script parameter if (@ARGV) { $full_source_file_path = $ARGV[0]; } else { die (" ! Please, enter file name as script parameter. \n See READM +E.txt for mor details"); } # Parse @ARGV, get path and file from script parameter ($source_file_name, $source_file_path, $source_file_suffix) = filepars +e($full_source_file_path, qr/\.[^.]*/); # Define all files and paths $source_file = "$source_file_name$source_file_suffix"; $logs_dir = "logs"; $converted_dir = "converted_files"; # Create folders for logs and converted files chdir ($source_file_path); mkdir ($logs_dir); mkdir ($converted_dir); # Output files get names after source file $txt_file = ("$source_file_name\_parsed.txt"); $xml_file = ("$source_file_name\_resulted.xml"); $file_log = ("$source_file_name\_debug.log"); # Files + folders knocked into variables $txt = "$converted_dir/$txt_file"; $xml = "$converted_dir/$xml_file"; $log = "$logs_dir/$file_log"; open (LOGFILE, "> $log") || die("Could not open file! $log"); open (TXT, "> $txt") || die("Could not open file! $txt"); log_and_print ("**************"); log_and_print ("Start working"); log_and_print ("STEP1: converting $source_file to TXT \n"); # see http://search.cpan.org/~jmcnamara/Spreadsheet-ParseExcel-0.59/li +b/Spreadsheet/ParseExcel.pm#SYNOPSIS # STEP1: The data from XLS file is stored in temp TXT file my $parser = Spreadsheet::ParseExcel->new(); my $workbook = $parser->parse($source_file); if ( !defined $workbook ) { die $parser->error(), ".\n"; } for my $worksheet ( $workbook->worksheets() ) { my ( $row_min, $row_max ) = $worksheet->row_range(); my ( $col_min, $col_max ) = $worksheet->col_range(); for my $row ( $row_min .. $row_max ) { for my $col ( $col_min .. $col_max ) { my $cell = $worksheet->get_cell( $row, $col ); next unless $cell; # Every cell is readed as row in .txt. Output format "R +ow, Col, value" $cell_unformatted = $cell->unformatted(); $cell_unformatted =~ s/\n/<br>/g; print_txt "$row;;$col;;", $cell_unformatted ,"\n"; } } } close(TXT); log_and_print ("STEP1: OK \n Result: file $txt is created \n"); log_and_print ("STEP2: converting $txt to XML \n"); # HERE THE KINGDOM OF CONVERTING BEGINS open (XML, "> $xml") || die("Could not open file! $xml"); # Read TXT to an array open (SOURCE, "$txt") || die ("Could not open file! $txt"); @parsed_data = <SOURCE>; close (SOURCE); # Initialize auxiliary variables $source_row_counter = 0; $commented = "//"; $header_of_xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<req +uirements>\n"; $end_of_kw = "</requirements>"; # Print first row of XML file print_xml $header_of_xml; # Read every row from .txt array. foreach $file_row (@parsed_data){ chomp $file_row; # Cell is processed according to its positions and value @splited_row = split(/;;/, $file_row); $cell_row_position = $splited_row[0]; $cell_col_position = $splited_row[1]; $cell_value = $splited_row[2]; print_log ("Debug: Current position: row - $cell_row_position, col + - $cell_col_position"); # First cell in excel file is expected to be mark of new test case + (one test case - one excel row) # If cell is first, all test case variable are cleaned up if ($cell_col_position == 0){ $test_requirement = ""; $tc_docid_tag = ""; $tc_title_tag = ""; $tc_version_tag = ""; $tc_revision_tag = ""; $tc_node_order_tag = ""; $tc_description_tag = ""; $tc_status_tag = ""; $tc_type_tag = ""; $tc_coverage_tag = ""; # TODO shift positions in arrays: in excel rows starts from 1, + in parsed excel from 0 print_log ("Debug: Start accumulate test case at row position +$cell_row_position"); } # Each value is checked if it is commented or not # Commented if cell is empty or first cell symbol is double slash +"//" no warnings; $if_commented = substr $cell_value, 0, 2; # Skip cells from Row1 and Column A - reserved for Header and comm +ents if (($cell_row_position == 0) and ($cell_col_position == 0)) { print_log ("File: Skip \"Header or Comments\". Cell position: +row - $cell_row_position, col - $cell_col_position"); } # Skip empty cells # Check that 1st test case has a test suite name (!mandatory) # Skip cells started with "//" elsif ($if_commented eq $commented) { print_log ("File: Skip empty cell. Cell begins with \"$if_comm +ented\", cell position: row - $cell_row_position, col - $cell_col_pos +ition"); log_and_print ("! File: Test suite (row2) must not be commente +d, otherwise .xml will be broken!") if (($cell_row_position == 1)&& ( +$cell_col_position ==1)); log_and_print ("! File: Test case name (row", $cell_row_positi +on+1, ") must not be commented, otherwise .xml will be broken!") if ( +$cell_col_position == 2); } # Accumulate test case record from all other values # Script use hard-coded values (columns numbers) and expects preci +se work with source file else { given ($cell_col_position) { # Block <requirement> # Block <docid> when (0) {; my $tc_docid = "$cell_value"; $tc_docid_tag = "<requirement><docid ><![CDATA +[$tc_docid]]></docid>\n"; print_log("Ligne\n $cell_col_position\n $tc_docid_ +tag") } # Block <title> when (1) { my $tc_title = "$cell_value"; $tc_title_tag = "<title><![CDATA[$tc_title]]>< +/title>\n"; } # Block <version> when (2) { my $tc_version = "$cell_value"; $tc_version_tag = "<version>$tc_version</versi +on>\n"; } # Block <revision> when (3) { my $tc_revision = "$cell_value"; $tc_revision_tag = "<revision> $tc_revision </ +revision>\n"; } # Block <node_order> when (4) { my $tc_node_order = "$cell_value"; $tc_node_order_tag = "<node_order>$tc_node +_order</node_order>\n"; } # Block <description> when (5) { my $tc_description = shift; my $tc_description = "$cell_value"; $t = encode_entities($tc_description, "àéèçûîùôâ"); $tc_description_tag = "<description><![CDATA[$ +t]]></description>\n"; } # Block <status> when (6) { my $tc_status = "$cell_value"; $tc_status_tag = "<status><![CDATA[V]]></status>\n +"; } # Block <type> when (7) { my $tc_type = "$cell_value"; $tc_type_tag = "<type><![CDATA[3]]></type>\n"; } # Block <excepted_coverage> when (8) { my $tc_coverage = "$cell_value"; $tc_coverage_tag = "<expected_coverage><![CDATA[1] +]></expected_coverage>\n"; # Final accumulated requirement $test_requirement = "$tc_docid_tag $tc_title_tag $tc_version_t +ag $tc_revision_tag $tc_node_order_tag $tc_description_tag $tc_status +_tag $tc_type_tag $tc_coverage_tag </requirement>\n"; print_xml $test_requirement; } # When cell is last in the excel row - call to print test case to +file if ($cell_col_position == 8) { $source_row_counter++; print_log ("File: Print accumulated keyword at row position $c +ell_row_position"); print_log ("Fichier $test_requirement"); print_xml $test_requirement; } } } } # Close XML file - double </testsuite> print_xml ($end_of_kw, "\n"); # deduct one row from row counter (reserved for headers) for get numb +er of processed Excel test cases $source_row_counter = $source_row_counter -1; log_and_print ("STEP2: OK \n Result: file $xml is created \n"); log_and_print ("Processed $source_row_counter rows from $source_file +\n See debug log for more details.\n\n"); close(XML); close(LOGFILE);
Block Description, the result is wrong, it' like that
éiùûôêçà<><br>aqwzsx
Thanks for your help

Replies are listed 'Best First'.
Re: Help encode_entities doesn't seem to work
by haukex (Archbishop) on Feb 09, 2019 at 22:38 UTC

    Hello and welcome to Perl and the Monastery, Balawoo.

    In your first piece of code, you have non-ASCII characters, but have commented out the use utf8;, why? utf8 is required to tell Perl that the source code is encoded in UTF-8, which is IMHO the best way to get non-ASCII characters into your Perl source. The following works fine for me when I save the source file as UTF-8:

    use warnings;
    use strict;
    use utf8;
    use HTML::Entities;
    
    print encode_entities("<br>àéèçûîùô<>"), "\n";
    
    __END__
    
    &lt;br&gt;&agrave;&eacute;&egrave;&ccedil;&ucirc;&icirc;&ugrave;&ocirc;&lt;&gt;
    

    As for your second piece of code, at the moment the formatting is broken, please fix your <code> tags. You've posted quite a long script, without input data. Please try to reduce this down to a Short, Self-Contained, Correct Example, that is, a piece of code that is as short as possible but still reproduces the problem, as well as some short sample input data. For example, it seems that the code is first converting the Excel file to a text file, and then turning that into an XML file - the whole Excel-to-text conversion could probably be removed from the question.

    Since it's quite difficult to test your code at the moment, I can only guess what might be going wrong in the script you showed. It looks like the code is using Spreadsheet::ParseExcel to parse an Excel file and is writing an XML file. I'm going to guess that there is some encoding issue, that is, maybe the strings you're getting from the Excel file are not encoded properly, or something is going wrong when writing to and reading from the intermediate text file.

    However, instead of trying to fix this, I have to say there are two general issues with the approach in this script: First, XML is not HTML, and XML by default does not know about any entities other than &quot;, &apos;, &lt;, &gt;, and &amp;, that's it. If you want to put non-ASCII characters in an XML file, then instead of using entities, IMHO the better way to do it is to make sure the file is written with the correct encoding, such as UTF-8, and make sure this is properly declared in the <?xml?> processing instruction at the top of the file. Second, I would not recommend writing the XML file manually like this. If you use a real XML module like XML::LibXML, it will do all the necessary escaping of special characters for you, and its methods for reading from and writing to files will take care of the encoding issues for you.

    If you invest the time into looking into how to use XML::LibXML, then I'm sure you'll be much happier in the long run than if you try to patch together XML like in the script you showed.

    use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML::Document->createDocument('1.0', 'UTF-8'); my $req = $doc->createElement('requirements'); $doc->setDocumentElement($req); my $desc = $doc->createElement('description'); my $str = "\N{U+E0}\N{U+E9}\N{U+E8}\N{U+E7}\N{U+FB}\N{U+EE}\N{U+F9}\N{ +U+F4}"; $desc->appendText($str); $req->appendChild($desc); $doc->toFile('out.xml',1);

    Produces this XML file, correctly encoded as UTF-8:

    <?xml version="1.0" encoding="UTF-8"?>
    <requirements>
      <description>àéèçûîùô</description>
    </requirements>
    

    For debugging encoding issues, I usually use two tools: my own script enctool to check what encoding is being used in the input files and the Perl source code, and inside of Perl, Dump from Devel::Peek to see exactly what bytes are being stored and whether Perl's internal UTF8 flag is set. This is information that you should also post here, so that we can also know exactly what data you're dealing with.

        Normally, PerlMonks is not a coding service, but this one happened to be interesting to me. It appears that the Excel file is encoded in one of the Mac formats, I'm guessing MacRoman. I think this does what you want:

        use warnings; use strict; use Spreadsheet::ParseExcel (); use Spreadsheet::Read 'ReadData'; use Encode 'decode'; use XML::LibXML; my $INFILE = 'TestPGR.xls'; my $ENCODING = 'MacRoman'; my $OUTFILE = 'TestPGR.xml'; my %FIELDS = ( 1=>'docid', 2=>'title', 3=>'version', 4=>'revision', 5=>'node_order', 6=>'description', 7=>'status', 8=>'type', 9=>'expected_coverage', ); my $book = ReadData($INFILE, rc=>1, cells=>0); my $sheet = $book->[1] or die "Book doesn't have a sheet 1"; my $doc = XML::LibXML::Document->createDocument('1.0', 'UTF-8'); my $reqs = $doc->createElement('requirements'); $doc->setDocumentElement($reqs); for my $r ( $sheet->{minrow}+1 .. $sheet->{maxrow} ) { my $req = $doc->createElement('requirement'); for my $c ( $sheet->{mincol} .. $sheet->{maxcol} ) { next unless exists $FIELDS{$c}; my $val = decode($ENCODING, $sheet->{cell}[$c][$r], Encode::FB_CROAK); my $node = $doc->createElement($FIELDS{$c}); $node->appendText($val); $req->appendChild($node); } $reqs->appendChild($req); } $doc->toFile($OUTFILE,1);

        Output (a UTF-8 encoded file):

        <?xml version="1.0" encoding="UTF-8"?>
        <requirements>
          <requirement>
            <docid>PP10-RG-010</docid>
            <title>MASTER DATA</title>
            <version>1</version>
            <revision>1</revision>
            <node_order>1</node_order>
            <description>Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624).</description>
            <status>V</status>
            <type>3</type>
            <expected_coverage>1</expected_coverage>
          </requirement>
          <requirement>
            <docid>PP10-RG-020</docid>
            <title>MASTER DATA</title>
            <version>1</version>
            <revision>1</revision>
            <node_order>2</node_order>
            <description>éiùûôêçà</description>
            <status>V</status>
            <type>3</type>
            <expected_coverage>1</expected_coverage>
          </requirement>
          <requirement>
            <docid>PP10-RG-030</docid>
            <title>MASTER DATA</title>
            <version>1</version>
            <revision>1</revision>
            <node_order>2</node_order>
            <description>éiùûôêçà&lt;&gt;
        aqwzsx</description>
            <status>V</status>
            <type>3</type>
            <expected_coverage>1</expected_coverage>
          </requirement>
        </requirements>
        
Re: Help encode_entities doesn't seem to work
by haj (Vicar) on Feb 10, 2019 at 01:01 UTC

    In addition to what haukex wrote: there are some items in your code which you might want to inspect.

    The character literal "àéèçûîùô" can be expressed in either UTF-8 or iso-latin-1. If you save your source file containing that literal as UTF-8, then you must also use utf8;, if you save it as iso-latin-1, you must not use utf8.

    Applied to your problem: In your simple example, both the characters given to encode_entities and the string to encode are in the same file, so they have the same encoding. In your longer program, the list of characters to encode is in your source code, but the characters to encode are coming from your text file. Therefore there is a chance of a mismatch, and as a first guess which explains your symptoms I'd say that your source file is saved as with UTF-8 encoding but without use utf8;.

    You also should be aware that Excel files in those "old" formats are usually not UTF-8 encoded. You don't need to take care for this because Spreadsheet::ParseExcel does it for you. But you need to take care for the format you are writing: By not specifying an encoding for both your text and XML files, you get Perl's default, iso-latin-1 encoding. That doesn't hurt for the text file, since for the characters in question you should not see warnings like "Wide character in print". On the other hand, XML files are, per default or in your case per explicit declaration, UTF-8 encoded. To get that right, you should open the XML file like this:

    open (XML, ">:encoding(UTF8",  $xml) || die("Could not open file! $xml: '$!'");

    BTW: In contrast to haukex, I don't think that Devel::Peek or the UTF-8-flag are very helpful in hunting down these problems unless you are programming on XS level.

      Thanks for your help

      My need is to transform an excel file to XML. Like I have said and as I'm a newbie. I'm starting based on a script working for a non latin language and try to adapt it.


      Of course I have seen my library to parse Excel file is not the best one, but like it seems working I have continue to use is.
      Testlink it's my aim, I need to fill file to be able to upload some data. My target file for requirement is:

      <?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement> <docid><![CDATA[ENG-0001]]></docid> <title><![CDATA[Terrestrial Propulsor]]></title> <description><![CDATA[]]></description> </requirement> <requirement> <docid><![CDATA[ENG-0002]]></docid> <title><![CDATA[Main Deflector]]></title> <description><![CDATA[<p>Maindeflector bla, bla, bla.</p>]]> </description> </requirement> </requirements>

        Hello Balawoo,

        While writing this up, I saw that haukex beat me with a complete solution at Re^3: Help encode_entities doesn't seem to work. I agree that this is how the problem should be solved and therefore refrain from posting a copy of your script with minimal changes applied. Send me a message if you want this.

        The spreadsheet and code you have given in your response to haukex's article are very helpful to track that down. So here are the issues (many of which have already been pointed out by haukex in Re: Help encode_entities doesn't seem to work):

        • I downloaded your source file, and it turned out to be UTF-8. In that case, since you have non-ascii-characters in your source file, you must announce this to the Perl interpreter with use utf8;. But as the next point shows, you might get away without them anyway.
        • If you write an XML file as UTF-8, then you don't need to encode any characters to their entities (it doesn't work anyway, as haukex points out, because XML doesn't know about these named entities, nor does XML have a <br> element).
        • Now for the tricky part: The cells in your Excel sheet are encoded in a "native" character set ($cell->encoding returns 3), and it can be tricky to divine which native set. In your case, it seems that it is one of the encodings which is not detected and handled by Spreadsheet::ParseExcel. I got pretty far by assuming it is MacRoman, because then the cell F2 translates properly to Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624). Apparently you need to decode the values by yourself, fortunately the Encode module knows about MacRoman.
        • As I already wrote, you need to write your XML files in UTF-8-format if you declare it to be so, and I recommend to do the same for your intermediate text file though it isn't strictly necessary as long as all your data can be expressed in iso-latin-1 as well.
        • Finally, your resulting XML file is invalid. The reason is that you skip cell A1 (only this cell has (($cell_row_position == 0) and ($cell_col_position == 0)) while your comment says that you want to Skip cells from Row1 and Column A - reserved for Header and comments. I doubt about column A which contains the docid, so you probably just want to skip the first row.

        After applying all of these changes, I end up with the following XML file:

        <?xml version="1.0" encoding="UTF-8"?> <requirements> <requirement><docid ><![CDATA[PP10-RG-010]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>1</node_order> <description><![CDATA[Le format et le contenu des 2 documents sont dé +crits dans la SFD XXX (JIRA 624).]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-020]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> <requirement><docid ><![CDATA[PP10-RG-030]]></docid> <title><![CDATA[MASTER DATA]]></title> <version>1</version> <revision> 1 </revision> <node_order>2</node_order> <description><![CDATA[éiùûôêçà<> aqwzsx]]></description> <status><![CDATA[V]]></status> <type><![CDATA[3]]></type> <expected_coverage><![CDATA[1]]></expected_coverage> </requirement> </requirements>
Re: Help encode_entities doesn't seem to work
by haukex (Archbishop) on Feb 10, 2019 at 10:49 UTC
      I have updated the both posts