Hello and welcome to Perl and the Monastery, Balawoo.
In your first piece of code, you have non-ASCII characters, but have commented out the use utf8;, why? utf8 is required to tell Perl that the source code is encoded in UTF-8, which is IMHO the best way to get non-ASCII characters into your Perl source. The following works fine for me when I save the source file as UTF-8:
use warnings;
use strict;
use utf8;
use HTML::Entities;
print encode_entities("<br>àéèçûîùô<>"), "\n";
__END__
<br>àéèçûîùô<>
As for your second piece of code, at the moment the formatting is broken, please fix your <code> tags. You've posted quite a long script, without input data. Please try to reduce this down to a Short, Self-Contained, Correct Example, that is, a piece of code that is as short as possible but still reproduces the problem, as well as some short sample input data. For example, it seems that the code is first converting the Excel file to a text file, and then turning that into an XML file - the whole Excel-to-text conversion could probably be removed from the question.
Since it's quite difficult to test your code at the moment, I can only guess what might be going wrong in the script you showed. It looks like the code is using Spreadsheet::ParseExcel to parse an Excel file and is writing an XML file. I'm going to guess that there is some encoding issue, that is, maybe the strings you're getting from the Excel file are not encoded properly, or something is going wrong when writing to and reading from the intermediate text file.
However, instead of trying to fix this, I have to say there are two general issues with the approach in this script: First, XML is not HTML, and XML by default does not know about any entities other than ", ', <, >, and &, that's it. If you want to put non-ASCII characters in an XML file, then instead of using entities, IMHO the better way to do it is to make sure the file is written with the correct encoding, such as UTF-8, and make sure this is properly declared in the <?xml?> processing instruction at the top of the file. Second, I would not recommend writing the XML file manually like this. If you use a real XML module like XML::LibXML, it will do all the necessary escaping of special characters for you, and its methods for reading from and writing to files will take care of the encoding issues for you.
If you invest the time into looking into how to use XML::LibXML, then I'm sure you'll be much happier in the long run than if you try to patch together XML like in the script you showed.
use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML::Document->createDocument('1.0', 'UTF-8'); my $req = $doc->createElement('requirements'); $doc->setDocumentElement($req); my $desc = $doc->createElement('description'); my $str = "\N{U+E0}\N{U+E9}\N{U+E8}\N{U+E7}\N{U+FB}\N{U+EE}\N{U+F9}\N{ +U+F4}"; $desc->appendText($str); $req->appendChild($desc); $doc->toFile('out.xml',1);
Produces this XML file, correctly encoded as UTF-8:
<?xml version="1.0" encoding="UTF-8"?> <requirements> <description>àéèçûîùô</description> </requirements>
For debugging encoding issues, I usually use two tools: my own script enctool to check what encoding is being used in the input files and the Perl source code, and inside of Perl, Dump from Devel::Peek to see exactly what bytes are being stored and whether Perl's internal UTF8 flag is set. This is information that you should also post here, so that we can also know exactly what data you're dealing with.
In reply to Re: Help encode_entities doesn't seem to work
by haukex
in thread [SOLVED] -Help encode_entities doesn't seem to work
by Balawoo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |