Removing certain non-word characters

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing certain non-word characters by davido (Cardinal) on Apr 26, 2004 at 07:01 UTC
The approaches that enumerate a-z and A-Z don't play nice with locales. For example, in the Portuguese character set, you will have vowels with ~, ', `, and ^ over them as part of the alphabet, and they don't fall within the range of a-z. \w is locale-smart, but has the unfortunate disadvantage of also containing '_' (underscore). So if you were to use \w, you would have to figure out some way of using s/// to eliminate all \W characters except hyphen, space, and tick, plus eliminate underscore. That can get a little convoluted. The easiest solution might be to use a couple of regexes instead of just one. Another solution might be to match what you want and leave out the rest. A solution that I considered (and Zaxo also mentioned in the CB) is to use the oft-neglected POSIX character classes: `$string =~ s/[^[:alnum:]\s'-]//g;` [download] Which says, "Substitute anything that is not alphanumeric, space, tick, or hyphen, with nothing (ie, just get rid of it)." Posix gets along with locales, so if your code ever ended up getting run in an environment where `use locale;` is in effect, it shouldn't break. Dave	[reply] [d/l] [select]
Re: Removing certain non-word characters by tkil (Monk) on Apr 26, 2004 at 06:38 UTC
Ask for what you want matched. In this case, you most likely want everything up to the next double-quote, yes? If so, something like this should work: `my ( $code ) = ( $line =~ /code="([^"]+)"/ );` [download] The use (and abuse) of regexes to match HTML content has been beaten to death. If you want stronger results, consider using a module designed to parse HTML. This is also covered in the fantastic book Mastering Regular Expressions. Some cases to watch out for: `<!-- watch out for greedy matching --> <tag code="blah" attr="nothing"> <!-- and for less-than characters in attribute values (which is likely illegal, but HTML in the wild is notoriously nasty this way) --> <tag code="<bang!>"> <!-- finally, make sure you can handle multiple-line tags --> <tag foo="bar" code="nothing">` [download]	[reply] [d/l] [select]
Re: Removing certain non-word characters by Zed_Lopez (Chaplain) on Apr 26, 2004 at 05:53 UTC
for the first: `s/[^a-zA-Z0-9'\x20]//g;` [download] With the second, please give an example. I suspect you're running into problems with the match being greedy.	[reply] [d/l]
Re: Re: Removing certain non-word characters by Anonymous Monk on Apr 26, 2004 at 06:20 UTC
I'm collecting meta tag information. I want to get the keywords section out, but just the keywords..not the meta tags itself. `<meta name="keywords" content="one,two,three,four">` [download] And I want to match anything inside of content="" but nothing else. I tried your s/// but that doesn't work, it made the regex a little worse actually. If I can get it to get the information I want, then the rest of the problems should go away. Thanks.	[reply] [d/l]
Re: Removing certain non-word characters by Anomynous Monk (Scribe) on Apr 26, 2004 at 06:03 UTC
`$var =~ tr/ 'A-Za-z0-9//cd;`	[reply] [d/l]