in reply to Re: Help encode_entities doesn't seem to work
in thread [SOLVED] -Help encode_entities doesn't seem to work

Thanks for your comment, I'll check the library about XML, like I need specific data included into brakes and like I said I'm not an expert on development (I'm a controller) I try to learn starting on code provide by another people. To have an exemple of data, you can load this file

https://drive.google.com/file/d/15UQuhSP28qDPZfSFRglCHu9GIYf0DuG2/view?usp=sharing

and the code

https://drive.google.com/file/d/1wy7SXxUlEjmK0DlroEHY14DP9FH49euk/view?usp=sharing

thanks

  • Comment on Re^2: Help encode_entities doesn't seem to work

Replies are listed 'Best First'.
Re^3: Help encode_entities doesn't seem to work
by haukex (Archbishop) on Feb 10, 2019 at 10:55 UTC

    Normally, PerlMonks is not a coding service, but this one happened to be interesting to me. It appears that the Excel file is encoded in one of the Mac formats, I'm guessing MacRoman. I think this does what you want:

    use warnings; use strict; use Spreadsheet::ParseExcel (); use Spreadsheet::Read 'ReadData'; use Encode 'decode'; use XML::LibXML; my $INFILE = 'TestPGR.xls'; my $ENCODING = 'MacRoman'; my $OUTFILE = 'TestPGR.xml'; my %FIELDS = ( 1=>'docid', 2=>'title', 3=>'version', 4=>'revision', 5=>'node_order', 6=>'description', 7=>'status', 8=>'type', 9=>'expected_coverage', ); my $book = ReadData($INFILE, rc=>1, cells=>0); my $sheet = $book->[1] or die "Book doesn't have a sheet 1"; my $doc = XML::LibXML::Document->createDocument('1.0', 'UTF-8'); my $reqs = $doc->createElement('requirements'); $doc->setDocumentElement($reqs); for my $r ( $sheet->{minrow}+1 .. $sheet->{maxrow} ) { my $req = $doc->createElement('requirement'); for my $c ( $sheet->{mincol} .. $sheet->{maxcol} ) { next unless exists $FIELDS{$c}; my $val = decode($ENCODING, $sheet->{cell}[$c][$r], Encode::FB_CROAK); my $node = $doc->createElement($FIELDS{$c}); $node->appendText($val); $req->appendChild($node); } $reqs->appendChild($req); } $doc->toFile($OUTFILE,1);

    Output (a UTF-8 encoded file):

    <?xml version="1.0" encoding="UTF-8"?>
    <requirements>
      <requirement>
        <docid>PP10-RG-010</docid>
        <title>MASTER DATA</title>
        <version>1</version>
        <revision>1</revision>
        <node_order>1</node_order>
        <description>Le format et le contenu des 2 documents sont décrits dans la SFD XXX (JIRA 624).</description>
        <status>V</status>
        <type>3</type>
        <expected_coverage>1</expected_coverage>
      </requirement>
      <requirement>
        <docid>PP10-RG-020</docid>
        <title>MASTER DATA</title>
        <version>1</version>
        <revision>1</revision>
        <node_order>2</node_order>
        <description>éiùûôêçà</description>
        <status>V</status>
        <type>3</type>
        <expected_coverage>1</expected_coverage>
      </requirement>
      <requirement>
        <docid>PP10-RG-030</docid>
        <title>MASTER DATA</title>
        <version>1</version>
        <revision>1</revision>
        <node_order>2</node_order>
        <description>éiùûôêçà&lt;&gt;
    aqwzsx</description>
        <status>V</status>
        <type>3</type>
        <expected_coverage>1</expected_coverage>
      </requirement>
    </requirements>
    
      Many thanks of your return
      I would like to convert MacRoman in HTML é --> & eacute (I put a space that HTML is not interpreter).
      I don't know how to do that

      Many thank Haukex

      it's strange if I do the same code in a XLSX file, I have an error on accent. I'm working to understand what it's wrong.

        it's strange if I do the same code in a XLSX file, I have an error on accent.

        XLSX files contain XML files, and assuming Excel is writing them correctly, these encoding issues should hopefully not arise. In other words, the strings you get should not need the extra decode step, and that line in my code could be changed to my $val = $sheet->{cell}[$c][$r];.

        I would like to convert MacRoman in HTML é --> & eacute

        As I said and as haj explained some more, this doesn't really make sense in the context of XML, only in HTML, but you haven't explained how you want to integrate the two. I think the best thing would be if you could show an example of exactly what the output file should look like. Note that even in HTML, if the encoding is declared correctly, you don't even need escapes like &eacute;.