The approaches that enumerate a-z and A-Z don't play nice with locales. For example, in the Portuguese character set, you will have vowels with ~, ', `, and ^ over them as part of the alphabet, and they don't fall within the range of a-z.
\w is locale-smart, but has the unfortunate disadvantage of also containing '_' (underscore). So if you were to use \w, you would have to figure out some way of using s/// to eliminate all \W characters except hyphen, space, and tick, plus eliminate underscore. That can get a little convoluted.
The easiest solution might be to use a couple of regexes instead of just one. Another solution might be to match what you want and leave out the rest. A solution that I considered (and Zaxo also mentioned in the CB) is to use the oft-neglected POSIX character classes:
$string =~ s/[^[:alnum:]\s'-]//g;
Which says, "Substitute anything that is not alphanumeric, space, tick, or hyphen, with nothing (ie, just get rid of it)."
Posix gets along with locales, so if your code ever ended up getting run in an environment where use locale; is in effect, it shouldn't break.
| [reply] [d/l] [select] |
Ask for what you want matched. In this case, you most
likely want everything up to the next double-quote, yes?
If so, something like this should work:
my ( $code ) = ( $line =~ /code="([^"]+)"/ );
The use (and abuse) of regexes to match HTML
content has been beaten to death. If you want stronger
results, consider using a module designed to parse
HTML. This is also covered in the fantastic book
Mastering Regular Expressions.
Some cases to watch out for:
<!-- watch out for greedy matching -->
<tag code="blah" attr="nothing">
<!-- and for less-than characters in attribute values
(which is likely illegal, but HTML in the wild
is notoriously nasty this way) -->
<tag code="<bang!>">
<!-- finally, make sure you can handle multiple-line tags -->
<tag foo="bar"
code="nothing">
| [reply] [d/l] [select] |
s/[^a-zA-Z0-9'\x20]//g;
With the second, please give an example. I suspect you're running into problems with the match being greedy. | [reply] [d/l] |
I'm collecting meta tag information. I want to get the keywords section out, but just the keywords..not the meta tags itself.
<meta name="keywords" content="one,two,three,four">
And I want to match anything inside of content="" but nothing else. I tried your s/// but that doesn't work, it made the regex a little worse actually. If I can get it to get the information I want, then the rest of the problems should go away. Thanks. | [reply] [d/l] |
$var =~ tr/ 'A-Za-z0-9//cd; | [reply] [d/l] |