http://qs1969.pair.com?node_id=11133090

derion has asked for the wisdom of the Perl Monks concerning the following question:

I have some XML code that I get with LWP::UserAgent.
The XML is declared and seems to contain ISO-8859-1 data with the header:
<?xml version="1.0" encoding="ISO-8859-1"?>...
I used the following code to get the data:
my $url = 'https://some.url/file.xml'; my $LWP_Data; use LWP::UserAgent; $LWP_Data->{ua} = LWP::UserAgent->new; $LWP_Data->{ua}->timeout(7); $LWP_Data->{feed} = $LWP_Data->{ua}->get($url); if ($LWP_Data->{feed}->is_success) { my $xml = $LWP_Data->{feed}->content; use XML::Simple qw(:strict); my $ref = XMLin( $xml, ForceArray => 1,KeyAttr => [ ]); ...

The results of $ref shows corrupted characters, e.g. ü instead of ü which made me think that the original data might not be ISO-8859-1 as declared. After some testing I assume that XML::Simple simply does not care about the declared encoding because the following works.

my $url = 'https://some.url/file.xml'; my $LWP_Data; use LWP::UserAgent; $LWP_Data->{ua} = LWP::UserAgent->new; $LWP_Data->{ua}->timeout(7); $LWP_Data->{feed} = $LWP_Data->{ua}->get($url); if ($LWP_Data->{feed}->is_success) { my $xml = $LWP_Data->{feed}->content; use Encode; my $encoded_xml = Encode::encode_utf8($xml); use XML::Simple qw(:strict); my $ref = XMLin( $encoded_xml, ForceArray => 1,KeyAttr => [ ]); # for a not nested hash ref foreach my $key (keys %{$ref }) { Encode::from_to($ref ->{$key}, "UTF-8", "iso-8859-1"); } ...

What I found in the forum is: Character Conversion Conundrum (https://perlmonks.org/?node_id=416914) which kind of implies XML::Simple converts to UTF-8 while it is not working when I only convert the results from utf-8 to iso-8859-1.

I am curious if what I assume is correct, if I might have read somewhere what I trialed and errored or if I am on the wrong track with my assumption.

Thanks for your feedback.
Regards
derion

Replies are listed 'Best First'.
Re: XML::Simple and ISO-8859-1 encoding buggy?
by GrandFather (Saint) on May 26, 2021 at 22:03 UTC

    From the XML::Simple documentation:

    SYNOPSIS

    PLEASE DO NOT USE THIS MODULE IN NEW CODE. If you ignore this warning and use it anyway, the qw(:strict) mode will save you a little pain.

    That doesn't directly address your conversion issue, but it does speak to the (low) likely hood of the module author fixing the problem. The last update to the module was over three years ago.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      While that is generally true, especially for generating XML files, i have yet to find another XML parser that is as simple and easy to use. XML::Rules is somewhat promising, but still adds a lot of extra complications.

      When it comes to parsing config files, i generally don't give an f### if it validates against some XSD schema (and in usually don't bother to even create one). I don't need "sequential parsing" or "streaming" because of the file size. All i want to do is turn that file into a hashref. And i don't care about all the "advanced" XML stuff like namespaces. The only reason i'm not using JSON is because i find XML easier to read and write by hand.

      For some time now i've been thinking of either forking XML::Simple (and rip out all the "write XML files" stuff) or writing a wrapper for something like XML::Rules that emulates the XML::Simple behaviour as close as possible. In my humble opinion, we really need a module that's as simple as XML::Simple, at least for reading XML config files. Basically the equivalent of simplicity to parsing a JSON file.

      perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
        > Basically the equivalent of simplicity to parsing a JSON file.

        The problem is the DOM tree of XML doesn't correspond 1:1 to common data structures (scalars, arrays, hashes).

        <r ch="1"> <ch>2</ch> <ch>3</ch> <!-- <ch>4</ch> --> </r>
        What data structure do you expect?

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        While I personally do think XML::Simple is sometimes (but rarely!) ok to use, for reading very simple XML files that happen to be structured in a way that the module does handle well (which one can make sure of via a Schema), IMHO the major problem of XML::Simple is that it does not handle the structure of the XML file changing very well at all. Someone may start out a project with a simple XML file that the module can handle, but as the project grows and the XML's structure becomes more complex, one begins jumping through hoops to bend the data structure back into shape. Another problem is that the module's name lends itself to the misunderstanding that it's a simple way to read arbitrary XML files, which is most certainly not the case. Both of these cases are well-represented across various threads on this site, and in most cases one is left arguing with the wisdom seekers who don't necessarily want to move away from a module they've already invested in. Hence the recommendation against the module in general makes sense - it's one of those "only use this if you know what you're doing and why" things.

        XML::Rules is somewhat promising, but still adds a lot of extra complications.

        Really? I personally don't think so; I have several XML::Rules examples on my scratchpad.

        True, for that reason I do not use XML::Twig for small and simple files but prefer XML::Simple.
        So having a solution to simply get XML data and put it in a hashref is exactly why I ended up with XML::Simple.
        I did not yet have any problems with the parsing side of XML::Simple and did not realize the "PLEASE DO NOT USE THIS MODULE IN NEW CODE." part.
Re: XML::Simple and ISO-8859-1 encoding buggy?
by cavac (Parson) on May 27, 2021 at 07:11 UTC

    After some testing I assume that XML::Simple simply does not care about the declared encoding because the following works.

    Frankly, i have seen enough XML files where the declared encoding is not matching the actual encoding. While your case might or might not be an XML::Simple problem, trusting the declared encoding is not much better...

    perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'

      I did not yet find a way to automatically determine the encoding of a file/data but would be very happy to have something for a at least good guessing.

      In my case I downloaded the file, opened it with Notepad++, the special characters were displayed as they should and Notepadd++ showed the information that the file is ISO-8859-1.
      Nevertheless I dare not say that the encoding could not be something else.

        I did not yet find a way to automatically determine the encoding of a file/data but would be very happy to have something for a at least good guessing.

        While sadly files with unknown encodings are not rare, IMHO guessing encodings should be a matter of last resort, much better is to find out how the files were generated and get their encoding from there. But if you really don't know the encoding of the input files, you could use a module like Encode::Guess, or I've written a tool that tries to be a little smarter: enctool - it allows you to narrow down the guesses by specifying what characters are expected to appear in the input file using e.g. the --one-of='\xFC' option (which in this case will look for U+00FC LATIN SMALL LETTER U WITH DIAERESIS ("ü"); you can specify multiple possible characters).

        And I must second the opinion that XML::Simple shouldn't be used. For full-fledged XML support, I'd suggest XML::LibXML, or for a XML::Simple replacement, have a look at XML::Rules.

Re: XML::Simple and ISO-8859-1 encoding buggy?
by ikegami (Patriarch) on May 27, 2021 at 21:21 UTC

    XML::Simple's design is extremely problematic. So much so that the module's own documentation tells you not to use it. wtf are you doing using this module?!


    XML::Simple and ISO-8859-1 encoding buggy?

    Decoding is handled by the XML parser. You didn't specify which XML parser you are using. (No, XML::Simple is not an XML parser.) XML::Parser is commonly used by XML::Simple, and XML::Parser handles iso-8859-1 just fine.

    use 5.014; use warnings; use XML::Simple qw( :strict ); # Taken from OP. use File::Slurper qw( read_binary ); my $xml = read_binary($ARGV[0]); # Make sure we know which parser is being used. local $XML::Simple::PREFERRED_PARSER = 'XML::Parser'; # Taken from OP. my $doc = XMLin($xml, ForceArray => 1,KeyAttr => [ ]); say sprintf "%vX", $doc;
    $ perl a.pl a_latin1.xml E9 $ perl a.pl a_utf8.xml E9

    On a terminal execting UTF-8:

    $ cat a_utf8.xml
    <?xml version="1.0"?><root>é</root>
    
    $cat a_latin1.xml | iconv -f iso-8859-1
    <?xml version="1.0" encoding="ISO-8859-1"?><root>é</root>
    

    Seeking work! You can reach me at ikegami@adaelis.com

      Wow, thank you very much, that was exactly the answer to the question.

      wtf are you doing using this module?!

      You are right, I started using it more than ten years ago and did not really look at the warning, constant amateur behaviour.

      No, XML::Simple is not an XML parser.

      Something I did not realize

      local $XML::Simple::PREFERRED_PARSER = 'XML::Parser';

      Defining the Parser leads to the expected behaviour and solves the task.

      I will have to try to move on to something else and will start with XML::Rules. Nevertheless the not recommended module works as it should with the above.

        Nevertheless the not recommended module works as it should with the above.

        You can "never" be sure that an XML::Simple solution works. That's the problem with it.

        Seeking work! You can reach me at ikegami@adaelis.com

Re: XML::Simple and ISO-8859-1 encoding buggy?
by Anonymous Monk on May 27, 2021 at 15:19 UTC

    Off-the-wall suggestion: what happens if, instead of hand-decoding $xml, you

    my $xml = $LWP_Data->{feed}->decoded_content;

    I do not use XML::Simple, but a quick read seems to say that decoding (if any) is done by the back end. Maybe the back end (whatever it is) assumes that if you hand it a string that string has already been decoded?

    Of course, this only works if the HTTP::Response object contains the encoding. You can check using lwp-request -m HEAD https://some.url/file.xml.

      That would be the opposite of what you want. You want the undecoded content.

      Seeking work! You can reach me at ikegami@adaelis.com