Re: Re: Re: Re: Re: Re: regex for utf-8

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 01:32 UTC
Must be a new Notepad. The Notepad I know from Windows NT only handles the current ANSI code page or "Unicode" which is saved as UTF-16LE w/BOM and CRLF's for line endings. UTF-16 is the way Windows functions that take "Unicode" like it. Well, almost... UCS-2 is 16 bits per code point, period. Full UTF-16 uses a group of 2048 special code points in pairs to represent values over 64K. LE is "little endian". In my experience, Notepad doesn't work any other way. BOM is the "Byte order mark", or "zero-width non-breaking joiner" which is basically a no-op character. It's code is U+FEFF, and there is no character FFFE. So read the first two bytes of the file, and you can tell whether it's LE or Big Endian. That character also has a particular encoding in UTF-8, if you care to figure it out. That can be used as a signature to identifiy UTF-8 files, too. Check out Unipad. It can save and load any format or variety. Playing with it might be enlightening. Also checkout the Unicode.org site. —John	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn (Scribe) on Mar 01, 2003 at 02:10 UTC
Thanks, John. This whole thing (dealing with non-ascii characters in a file) started as a little detail and is growing to consume me. I know a whole lot more about utf-8 and codepages than I knew two days ago, but I still can't search and count the non-unicode characters in the file. I'm trying to get a general solution, but high-ups are satisfied with "Pour it through MS Access and it will come out converted". We have an international product distributed as flat tab-delimited text files, and I don't think that the MS Access pouring approach will work for everybody unless they are only using windows ansi codepage.	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 04:16 UTC
What's a "non-unicode character" in a file? Perl has modules for extensive manipulation in this area, and Perl reads UTF-8 nativly.	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn (Scribe) on Mar 01, 2003 at 05:32 UTC
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 05:52 UTC
Some notes below your chosen depth have not been shown here
Quick script? by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 06:00 UTC