in reply to Re: Remove all non alphanumeric characters excluding space, underscore and minus sign
in thread Remove all non alphanumeric characters excluding space, underscore and minus sign

Text::Unidecode can remove the accents, but leave the basic character in place. First run the undidecode unidecode function on your string and then apply the regex.

Update: fixed a typo. Thanks BrowserUK.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re^2: Remove all non alphanumeric characters excluding space, underscore and minus sign
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign
by BrowserUk (Patriarch) on Feb 13, 2012 at 13:49 UTC

    Will it work on Extended ANSI codepages? Or only Unicoded input?

    Plus, it might be better to tell the OP rather than me since he's the one looking for it.

    (ps. Is undidecoding, extracting that which makes Diddy men what they are, from their DNA? :)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      The docs of Text::Unidecode say: "It often happens that you have non-Roman text data in Unicode" and "If you get really implausible nonsense out of unidecode(...), make sure that the input data really is a utf8 string."

      So out of caution I add "use utf8;" to my scripts using Text::Unidecode.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        So out of caution I add "use utf8;" to my scripts

        That doesn't help if the text in question is coming from the keyboard or a ANSI format file source does it?

        (Hence why I suggested that the OP might need to clarify the source of the data.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re^3: Remove all non alphanumeric characters excluding space, underscore and minus sign
by ikkeniet (Acolyte) on Feb 13, 2012 at 13:49 UTC
    CountZero, thanks a lot! you made my day :-)