see the UTF-8 and unicode FAQ.
In reply to Re: Extracting MS Word text and encoding HTML entities by Joost in thread Extracting MS Word text and encoding HTML entities by wfsp