I have a large repository of documents that was authored using M***%$$t Word and converted to XML
My problem is there are a lot of characters that won't display properly - `back-ticks` look like this â~@~X and â~@~Y, GB pound(£) signs look like this £ when viewed in VI for instance.
I can rid of most of my woes by setting the content type to UTF-8 however in some cases these unwanted characters just display as a single ?
but any ideas how I can crunch thru the files and get rid of the rest of the crap????