Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Converting HTML special entities to XML

by iburrell (Chaplain)
on Sep 01, 2004 at 22:49 UTC ( [id://387737] : note . print w/replies, xml ) Need Help??


in reply to Re: Converting HTML special entities to XML
in thread Converting HTML special entities to XML

I think it is better to translate them to character references. The entities can't be represented accurately other than with Unicode. The HTML entity resolver would need to produce UTF-8 strings.

This assumes that the HTMl to XML process is converting escaped text to escaped text. If the text is being unescaped for other reasons, then the entities should be expanded to UTF-8 and escaped on output.

Replies are listed 'Best First'.
Re^3: Converting HTML special entities to XML
by Aristotle (Chancellor) on Sep 01, 2004 at 23:36 UTC

    They should always be expanded to UTF-8 and escaped on output. Your HTML parser should just give you Unicode, and whatever XML generator you use should be escaping it automatically for you as appropriate for the target encoding.

    Don't attempt to transcode entities and what manually to insert literal bytes into the output XML stream. That way lies madness (and a lot of buggy code; most code dealing with XML out there is quite broken with regard to encodings).

    Makeshifts last the longest.

      It really depends on what kind of processing you are doing. Dealing with the unescaped characters is the safest approach but it requires dealing with charset issues, making sure the output is escaped properly.

      Dealing with the escape text, in its native charset, is simpler. Character references can help because you don't need to worry about character sets for them; they are always Unicode. In fact, they are the safest way to get Unicode characters in a document with all the charset mangling that goes on.

        All that is why I'm giving the recommendation that I'm giving. :-)

        You can work with escaped characters avoid going through Unicode, if you wish; but it is hard to get that really right and most people don't.

        That's why I assert that you should not work with the HTML directly and should not work with the XML directly. It's safest to think of HTML and XML not as a data format, but as an opaque serialization of a data structure. You ask one deserializer for the data structure, and get something unambiguous (ie, Unicode) that you can work to your heart's content with; then you give the still unambiguous result to another serializer that produces conforming output for you.

        As I said, you can do it differently. Just as you can avoid using strict. It's jut much easier to not shoot yourself in the foot if you stick to that practice.

        Makeshifts last the longest.