the_sheriff has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of html files with a specific non-printing character I want to get rid of. For example, in the string "Children's" ... the apostrophe appears to be a multibyte character as shown below:
% grep "Children" index.html | od -c 0003060 C h i l d r e n 342 200 231 s
For that file I can run the following command to fix it:
% tr -s '\342\200\231' \' < file.html
but since I have a ton of files scattered everywhere, I'd like to be able to do something like the following:
% perl -i -pe "s/\342\200\231/'/g" `find /home -name "*.html"`
...but no luck. I've read the part of perlfaq6 addressing this but it seems to say you can do this if you just search for the byte as if it was separate bytes...like what I have above...but there's a good chance I'm misunderstanding. Has anybody done something similar or have any suggestions?

Replies are listed 'Best First'.
Re: multibyte match works with tr but not in perl??
by belg4mit (Prior) on Jan 24, 2003 at 07:00 UTC
    That kinda looks like a Unicode character. If you're using perl 5.6 or greater you must get away from thinking of files as a sequence of bytes and instead as a sequence of characters. See perlunicode for more.

    --
    I'm not belgian but I play one on TV.

Re: multibyte match works with tr but not in perl??
by Enlil (Parson) on Jan 24, 2003 at 05:36 UTC
    the main problem I think is that tr/// and s/// do different things. two seperate operators within perl

    update:You can find more info here: perlop

    -enlil

Re: multibyte match works with tr but not in perl??
by skx (Parson) on Jan 24, 2003 at 14:40 UTC
    Could you not use the posix printable class?
    % perl -pi.bak -e "s/[[:^print:]]//g" `find /home -name "*.html"`
    Steve
    ---
    steve.org.uk