in reply to Re: UTF-8 entities in XML/HTML?
in thread UTF-8 entities in XML/HTML?

- it does help, thanks! Your answer is relevant for my data migration phase I haven't tackled yet. Momentarily I try to make the script working in a clean way. Phase two will be converting the old messages to UTF-8, where your hints will become handy.

Since you have touched German entities before, I have heard there are three ways to encode them (at least in HTML), one is via named entity, like ä, another via one-byte numeric code situated between 128..255. And third is the two-bytes one I am trying to achieve (as I thought it's a more generic way to encode things). Any idea what should be the preference?

Jot

Replies are listed 'Best First'.
Re^3: UTF-8 entities in XML/HTML?
by pat_mc (Pilgrim) on Sep 04, 2008 at 16:29 UTC
    Hi, Jot -

    I was working with the SALSA corpus of syntactically and semantically annotated German newspaper sentences. The corpus follows the TIGER annotation standards.

    In the corpus, a UTF-8 encoded lowercase German a-umlaut ('ä'), e.g., would be rendered in ISO-8859-1 as ä. I am not sure, however, which encoding variant of those you mention this corresponds to.

    Hope this helps anyway.

    Pat