stew has asked for the wisdom of the Perl Monks concerning the following question:

I have a large repository of documents that was authored using M***%$$t Word and converted to XML

My problem is there are a lot of characters that won't display properly - `back-ticks` look like this â~@~X and â~@~Y, GB pound(£) signs look like this £ when viewed in VI for instance.

I can rid of most of my woes by setting the content type to UTF-8 however in some cases these unwanted characters just display as a single ?

I can get rid of ^M no problem using

perl -pi.bak -e 's/^M//g' *.xml

but any ideas how I can crunch thru the files and get rid of the rest of the crap????

Replies are listed 'Best First'.
Re: Cleaning Up Text Files
by thinker (Parson) on Oct 25, 2002 at 11:23 UTC
    Hi stew,

    You might be interested in looking at the Demoroniser.


    The author claims

    This page describes, in Unix manual page style, a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications. The demoroniser keeps you from looking dumber than a bag of dirt when your Web page is viewed by a user on a non-Microsoft platform.

    Perhaps this will help you.

    cheers

    thinker
      Thats solves a few problems, cheers, but there are still a lot of wierdness going on.
Re: Cleaning Up Text Files
by kryberg (Pilgrim) on Oct 25, 2002 at 12:34 UTC
    I too have some legacy Microsoft files that need to be cleaned up and I just haven't taken the time to sit down and do it yet. I was happy to see your simple code
    perl -pi.bak -e 's/^M//g' *.xml
    and I tried doing it as
    perl -pi.bak -e 's/^M//g' testfile.html
    It created the backup file, but did not remove the ^M's. Are there any other tricks to this?

    Thanks.
      I think I have used \c instead of ^ in a regex in the past to do a similar thing. In your example:
      perl -pi.bak -e 's/\cM//g' testfile.html
        Cool! It worked. That little script made my day.

        I even used perl -pi.bak -e 's/\cM//g' *.html and my files were cleaned up in a couple seconds!

        Thanks for the help.
      Well that code works fine for me but you can try this one too and see what happend...
      perl -i.bak -pne 's/^M//g' file
      I only submit the -n option
        But if you're already using -p, there's no need for -n as well. perlrun says, "A -p overrides an -n switch." So adding -n actually does nothing at all.

        -- Mike

        --
        just,my${.02}