Cleaning Up Text Files

stew has asked for the wisdom of the Perl Monks concerning the following question:

I have a large repository of documents that was authored using M***%$$t Word and converted to XML

My problem is there are a lot of characters that won't display properly - `back-ticks` look like this в~@~X and в~@~Y, GB pound(Ј) signs look like this ВЈ when viewed in VI for instance.

I can rid of most of my woes by setting the content type to UTF-8 however in some cases these unwanted characters just display as a single ?

I can get rid of ^M no problem using

perl -pi.bak -e 's/^M//g' *.xml

but any ideas how I can crunch thru the files and get rid of the rest of the crap????

Comment on Cleaning Up Text Files Download Code

Replies are listed 'Best First'.
Re: Cleaning Up Text Files by thinker (Parson) on Oct 25, 2002 at 11:23 UTC
Hi stew, You might be interested in looking at the Demoroniser. The author claims This page describes, in Unix manual page style, a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications. The demoroniser keeps you from looking dumber than a bag of dirt when your Web page is viewed by a user on a non-Microsoft platform. Perhaps this will help you. cheers thinker	[reply]
Re: Re: Cleaning Up Text Files by stew (Scribe) on Oct 25, 2002 at 13:16 UTC
Thats solves a few problems, cheers, but there are still a lot of wierdness going on.	[reply]
Re: Cleaning Up Text Files by kryberg (Pilgrim) on Oct 25, 2002 at 12:34 UTC
I too have some legacy Microsoft files that need to be cleaned up and I just haven't taken the time to sit down and do it yet. I was happy to see your simple code `perl -pi.bak -e 's/^M//g' *.xml` [download] and I tried doing it as `perl -pi.bak -e 's/^M//g' testfile.html` [download] It created the backup file, but did not remove the ^M's. Are there any other tricks to this? Thanks.	[reply] [d/l] [select]
Re: Re: Cleaning Up Text Files by roik (Scribe) on Oct 25, 2002 at 12:41 UTC
I think I have used \c instead of ^ in a regex in the past to do a similar thing. In your example: `perl -pi.bak -e 's/\cM//g' testfile.html` [download]	[reply] [d/l]
Re: Re: Re: Cleaning Up Text Files by kryberg (Pilgrim) on Oct 25, 2002 at 12:51 UTC
Cool! It worked. That little script made my day. I even used `perl -pi.bak -e 's/\cM//g' *.html` and my files were cleaned up in a couple seconds! Thanks for the help.	[reply] [d/l]
Re: Re: Cleaning Up Text Files by cored (Scribe) on Oct 25, 2002 at 12:44 UTC
Well that code works fine for me but you can try this one too and see what happend... `perl -i.bak -pne 's/^M//g' file` [download] I only submit the -n option	[reply] [d/l]
Re: Re: Re: Cleaning Up Text Files by thelenm (Vicar) on Oct 25, 2002 at 15:46 UTC
But if you're already using -p, there's no need for -n as well. perlrun says, "A -p overrides an -n switch." So adding -n actually does nothing at all. -- Mike `-- just,my${.02}`	[reply]