in reply to Re: Re: Re: Re: Re: Re: regex for utf-8
in thread regex for utf-8
UTF-16 is the way Windows functions that take "Unicode" like it. Well, almost... UCS-2 is 16 bits per code point, period. Full UTF-16 uses a group of 2048 special code points in pairs to represent values over 64K.
LE is "little endian". In my experience, Notepad doesn't work any other way.
BOM is the "Byte order mark", or "zero-width non-breaking joiner" which is basically a no-op character. It's code is U+FEFF, and there is no character FFFE. So read the first two bytes of the file, and you can tell whether it's LE or Big Endian.
That character also has a particular encoding in UTF-8, if you care to figure it out. That can be used as a signature to identifiy UTF-8 files, too.
Check out Unipad. It can save and load any format or variety. Playing with it might be enlightening.
Also checkout the Unicode.org site.
—John
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by jjohhn (Scribe) on Mar 01, 2003 at 02:10 UTC | |
by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 04:16 UTC | |
by jjohhn (Scribe) on Mar 01, 2003 at 05:32 UTC | |
by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 05:52 UTC | |
by jjohhn (Scribe) on Mar 02, 2003 at 19:48 UTC | |
| |
by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 06:00 UTC |