Rodster001 has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

Sorry if this has been asked before, I looked though the Q&A and the web a little. The problem is I don't really know what I am really looking for (not sure what these types of characters are called).

I have some files with some funkyness like this:

Is there a way to match these types of characters? I tried a few things. I thought they might be control characters but \cK or \000-\037 approach didn't work.

Can someone point me in the right direction? Thanks!

UPDATE: The solution I came up with was to convert the string to Hex and then check if those values fell outside a certain range (commonly used characters have lower value). So, checking if the Hex value was > 150 helped me to identify any problem areas and replace those characters.

  • Comment on Matching  & € type characters with a regex

Replies are listed 'Best First'.
Re: Matching  & € type characters with a regex
by ikegami (Patriarch) on Feb 12, 2009 at 18:30 UTC

    It seems you are displaying UTF-8 as iso-latin-1 or similar. "Â" is not a character, "®" is the encoding of one character. Decode your encoded strings on input. Appropriately encode your decoded strings on output.

      Ok. Sooooo... what do I do? I can go though and find all the characters and do s/Â//gsi for each character. Or, is there an easier way to match these types of characters?
        First you have to understand what character encodings are, and how they are handled in Perl.

        I've written this article to explain that, and there's also a lot of other useful information: perluniintro, Encode, perlunicode.

        If you decode the input as I suggested, you won't have any "Â" or even "®", just the single character those bytes represent. There isn't anything to search and replace.
Re: Matching  & € type characters with a regex
by graff (Chancellor) on Feb 13, 2009 at 05:20 UTC
    Just for grins, download the script I posted here a while ago: tlu -- TransLiterate Unicode. Run your "funky" data through that script and see what comes out.

    If you see stuff that looks like \x{02BC} or \x{2019} then what you have is utf8 text data with some "wide" characters in it, and your initial problem, as explained by ikegami, is that you aren't looking at it the right way or using the right tools to view it. The "tlu" script converts wide characters into their "literal" hex-numeric code-point form, using perl syntax by default.

    Some of your wide characters will have ascii and (single-byte) Latin-1 equivalents (e.g. the apostrophe or right-single-quote mark or the copyright symbol), but some might not. By reading the data as utf8 (the way it's supposed to be read), there are lots of ways in perl to easily fix or remove them as you see fit.

      So, running those three lines though your script I got exactly what you expected:
      Course\x{fffd} Syllabus\x{fffd} Operator\x{2019}s Manual Windows\x{00ae} Version 5.2
      Now, I can think of several ways to go from here. My converting to hex (in my update above) worked but wasn't really ideal (the (R) mark would have been tossed).

      If your goal was to convert those three lines into this:

      Course Syllabus Operator's Manual Windows® Version 5.2
      What would you do next?

      Thanks for the help!

        There's a discussion on normalising accents in The Björk Situation.

        I use thundergnat's approach but note the limitations pointed out by other monks.

        Perhaps this could be adapteded to suit your needs?

        The "fffd" characters are the unicode "replacement character", which is what you get when something tries to convert non-unicode data, assuming that it's some particular character set, into unicode, and the process comes across a byte value or byte sequence that shouldn't really exist in the original character set (and so cannot be mapped to a meaningful unicode character).

        In this particular case, the data going into tlu was utf8 (based on the correct rendering of the "right-single-quote" and the symbol following "Windows", but the characters after "Course" and Syllabus" were already messed up before going into tlu.

        For things that aren't messed up, you either s/widechar/asciichar/g; (e.g. s/\x{2019}/'/g) or you tr/widechar//d (i.e. get rid of them). For fffd, probably best just get rid of it, but maybe figure out what put it there in the first place.