Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

, , and XML::Simple

by BioHazard (Pilgrim)
on May 02, 2002 at 19:14 UTC ( #163641=perlquestion: print w/replies, xml ) Need Help??

BioHazard has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I am using XML::Simple on my Windows machine and it works quite well. There is just one thing I do not understand: XML::Simple does not seem to cope with , and . I need these characters in german language - HTML(!)-Output. I have taken a closer look to the and have searched for an escape method for ", & etc. I have found it but only for XMLout(). I could change at least that method to deal with , and for XML-Output but the more important thing is this XMLin()-Thing. It does read an from the XML-File for example but not & auml ; <- But that is exactly what I need for the HTML-Output. Is there a way to escape these values without making changes in the whole hash-reference tree produced by XMLin()?

Thank you for your help

P.S.: I have to apologize for my english...

Replies are listed 'Best First'.
Re: , , and XML::Simple
by choocroot (Friar) on May 02, 2002 at 19:52 UTC
Re: , , and XML::Simple
by mojotoad (Monsignor) on May 02, 2002 at 22:23 UTC
    I have not used XML::Simple, but shouldn't it pay attention to the encoding specified in the XML declaration? For German, you'd want the Latin-2 alphabet:

    <?xml version="1.0" encoding="ISO-8859-2">

    I gather that XML::Simple can ride on top of XML::Parser. From the XML::Parser docs:

    Expat has built-in encodings for: UTF-8, ISO-8859-1, UTF-16, and US-ASCII. Encodings are set either through the XML declaration encoding attribute or through the ProtocolEncoding option to XML::Parser or XML::Parser::Expat. For encodings other than the built-ins, expat calls the function load_encoding in the Expat package with the encoding name. This function looks for a file in the path list @XML::Parser::Expat::Encoding_Path, that matches the lower-cased name with a '.enc' extension. The first one it finds, it loads.


Re: , , and XML::Simple
by ChemBoy (Priest) on May 03, 2002 at 07:21 UTC

    I've dealt with various aspects of this problem at different times, so let me take a stab here...

    The first option that comes to mind is this: if XML::Simple can handle your character set, and your character set is an acceptable one for web browsers (such as ISO-Latin-1), why not just use the raw characters? Most browsers that can display the characters correctly at all can handle that character set, as far as I know.

    However, I'll try to answer the opposite question as well (can't hurt, and might just be helpful).

    The problem you're having is that XML::Simple does not recognise the entities you're passing it in your XML source. This is entirely appropriate--as far as I know, XML::Simple only understands basic XML entities, of which (again, as far as I know) there are very few: only & < and > (&amp; &lt; and &gt;) spring to mind. Therefore, when it encounters something like &auml;, which is unquestionably an entity but not one it's familiar with, it does what every good XML parser does when it finds something unexpected: die.

    The obvious solution to this is to tell the parser to recognize your entities, but there are two objections:

    1. that could easily get rather un-simple
    2. that would definitely defeat your original purpose

    Why this last? Well, when the XML::Simple spits out your parsed data, it has already translated the entities in its input to the corresponding character data (much as the web browser will with the HTML entities). Which leaves us right where we started, really--if you can handle outputting to the browser, then just put it in your XML source to begin with.

    However, this suggests the solution that I personally have used for this problem the few times I've encountered it: double-escape the data going into your XML source. That is, if you want to parse your XML and have it contain the string "&eacute;", arrange for your XML source file to contain the string "&amp;eacute;". The alternative is to enclose the relevant sections in CDATA tags, which is acceptable for some things (including wholesale HTML markup in XML files) but generally overkill, in my opinion.

    To actually do this programatically (assuming you're dealing with input that includes the literal characters you're trying to escape), you're probably best off with HTML::Entities, as mentioned above: it's distributed with HTML::Parser but does not partake of the weightyness of that module (or its need for compilation). If you have it installed, then something along these general lines should do the trick:

    use HTML::Entities; while (<TEXT_FILE>) { encode_entities $_;
    encode_entities $_; # yes, really twice
    do_stuff($_); } print XMLout ($foo); # the data structure built by do_stuff()

    Possibly the lamest code example I've ever posted, that... I do suggest that comment, though, for the benefit of your associates and successors. If that doesn't encode all the characters you need encoded, check out the other parameters to that function--it can do what you need done.

    Good luck!

    Update: added print line to snippet, in a possibly doomed attempt to make it resemble actual code.

    Update: doh! Working too hard and thinking too little--XMLout does, of course, escape XML entities, so only one round of HTML escaping is called for (if you're using XMLout). Thanks to ajt for the catch!

    If God had meant us to fly, he would *never* have given us the railroads.
        --Michael Flanders

      Thank you very much!

      What I needed was a connection of mojotoads and ChemBoys suggestion. I have not thought that UTF-8 does not take etc. With ISO-8859-2 the script does not die. And with this double encoding like " & amp;auml; " the Browser prints out the string I actually wanted.

      again, thank you for helping me!


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://163641]
Approved by VSarkiss
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2022-08-13 09:05 GMT
Find Nodes?
    Voting Booth?

    No recent polls found