Re^2: UTF-8 entities in XML/HTML?

- it does help, thanks! Your answer is relevant for my data migration phase I haven't tackled yet. Momentarily I try to make the script working in a clean way. Phase two will be converting the old messages to UTF-8, where your hints will become handy.

Since you have touched German entities before, I have heard there are three ways to encode them (at least in HTML), one is via named entity, like ä, another via one-byte numeric code situated between 128..255. And third is the two-bytes one I am trying to achieve (as I thought it's a more generic way to encode things). Any idea what should be the preference?

Jot

Comment on Re^2: UTF-8 entities in XML/HTML? Download Code

Replies are listed 'Best First'.
Re^3: UTF-8 entities in XML/HTML? by pat_mc (Pilgrim) on Sep 04, 2008 at 16:29 UTC
Hi, Jot - I was working with the SALSA corpus of syntactically and semantically annotated German newspaper sentences. The corpus follows the TIGER annotation standards. In the corpus, a UTF-8 encoded lowercase German a-umlaut ('ä'), e.g., would be rendered in ISO-8859-1 as `Ã¤`. I am not sure, however, which encoding variant of those you mention this corresponds to. Hope this helps anyway. Pat	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: UTF-8 entities in XML/HTML?
by pat_mc (Pilgrim) on Sep 04, 2008 at 16:29 UTC

SALSA corpus

TIGER annotation standards

Ã¤

[reply]
[d/l]