Something strange in the world or Regexes

mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Something strange in the world or Regexes by Corion (Patriarch) on Sep 30, 2009 at 09:52 UTC
Maybe something went wrong when pasting your code, because the following "hex dumper" shows me a `\x20` from a line pasted: `perl -lne "printf qq([%s] => %02x\n), $_, ord $_ for split //" mmu-miR-150 [m] => 6d [m] => 6d [u] => 75 [-] => 2d [m] => 6d [i] => 69 [R] => 52 [-] => 2d [1] => 31 [5] => 35 [0] => 30 [ ] => 20` [download]	[reply] [d/l] [select]
Re^2: Something strange in the world or Regexes by mrguy123 (Hermit) on Sep 30, 2009 at 09:56 UTC
It makes sense that by copying and pasting the code I got rid of the "non core" character. Can you explain what you did with the "hex dumper"? Thanks, mrguy	[reply]
Re^3: Something strange in the world or Regexes by Corion (Patriarch) on Sep 30, 2009 at 10:04 UTC
Something I find quite helpful with one liners is B::Deparse: `>perl -MO=Deparse -lne "printf qq([%s] => %02x\n), $_, ord $_ for spli +t //" BEGIN { $/ = "\n"; $\ = "\n"; } LINE: while (defined($_ = <ARGV>)) { chomp $_; printf "[%s] => %02x\n", $_, ord $_ foreach (split(//, $_, 0)); }` [download] So that program goes through the input, splits it between each byte, and then feeds each byte as a character and as the number to the printf function.	[reply] [d/l]
Re: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 10:52 UTC
You need to decode the bytes into characters before treating them as characters: `$ perl -e'print "\xc2\xa0"' \| perl -le' binmode STDIN, ":encoding(UTF-8)"; chomp( $_ = <STDIN> ); print /\s/ ? "space" : "no space"; ' space` [download]	[reply] [d/l]
Re^2: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 11:27 UTC
Interesting observation (to me at least): The single line of `binmode stdin :encoding(X)` does extend \s for both X = LATIN1 and UTF8 to include the non-breakable space. Without it, regardless of LANG and LC_* variables set in the shell, you've the old semantics for \s. Not (yet?) mentioned in perldoc -f binmode, but OTOH it mentions a nice way to flush STDIN. Looks like binmode and PerlIO got way more interesting in the meantime :).	[reply] [d/l]
Re^3: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 17:30 UTC
Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe: `$ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space` [download] Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[2] By default, strings are internally encoded as iso-latin-1 if possible.[3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching. Under byte semantics, `\s` matches whitespace in the ASCII range only. Under unicode semantics, `\s` matches anything Unicode considers whitespace[4], which include NBSP (U+00A0). The internal encoding of a string can be manipulated using `utf8::ugprade` and `utf8::downgrade` 1 — This post doesn't discuss the effects of `use locale`, if any. 2 — Expect (backwards compatible) changes in this area in 5.12. 3 — This post doesn't discuss the effects of (broken) `use encoding`, if any. 4 — There are bugs in many properties, but I don't think `\s` has any errors. These are being fixed for 5.12.	[reply] [d/l] [select]
Re^4: LC_*: Something horrible in the world of Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 17:54 UTC
Re^5: Something horrible in the world of Regexes - attack of the posix zombies by ikegami (Patriarch) on Sep 30, 2009 at 18:14 UTC
Re: Something strange in the world or Regexes by SilasTheMonk (Chaplain) on Sep 30, 2009 at 09:49 UTC
Given that this has ultimately come from a web page I suspect that it might be some sort of non-core character. What I would suggest is that work out what characters are allowable. Some regular expression like `/^([\w\-]+)/` and use $1 to extract the string of interest. Actually you should make your regular expression match as closely as possible what is allowable. I doubt that anything malicious is going on here, but in general suspicious characters entered via web pages is a common form of hack on the internet. Taint mode (-T) is the usual defence in the perl world and you may want to read up on that even if you decide it is overkill in your case.	[reply] [d/l]
Re^2: Something strange in the world or Regexes by mrguy123 (Hermit) on Sep 30, 2009 at 10:11 UTC
That's pretty much what I did. However, I would like to know what's going on with this non-core character.	[reply]
Re: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 09:52 UTC
Try to post a link to the real input. Downloading it I see no space at the line ends at all: `00000000 6d 6d 75 2d 6d 69 52 2d 37 30 34 a0 0d 0a 6d 6d \|mmu-miR-704...mm\|` Which is slightly funny alright (LFA0+CRLF; thx to ikegami below), but possibly an artefact of the site markup + pasting. And probably NOT what you're using. (If I just paste from the node, I also see a trailing 0x20 space, like Corion) also to check: how do you get the input: $temp - does it contain any LF or CR line endings? update: also to check: char encoding of the stuff you get? cat -vet/hd/od -x might help in figuring out things (the last two being examples of those "hex dumpers" on unix or in cygwin, xxd is probably also widely available and used for vim's pseudo hex mode).	[reply] [d/l]
Re^2: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 11:00 UTC
A0+CRLF, not 0A+CRLF	[reply]
Re^2: Something strange in the world or Regexes by mrguy123 (Hermit) on Sep 30, 2009 at 10:04 UTC
Good idea! The link is here As you can now see, there is a weird character at the end of each line. It seems we now know what the problem is. Only question is, how did it get into the input and how can I regex it away?	[reply]
Re^3: Something strange in the world or Regexes by almut (Canon) on Sep 30, 2009 at 10:22 UTC
It's the UTF-8 encoding (`0xC2 0xA0`) of the non-breaking space (which is not included in the "whitespace" set of chars¹ — thus your regex didn't match). ___ ¹ update: at least not the iso-latin-1 encoding of the character, i.e. `0xA0` (for backwards compatibility, Perl assumes iso-latin-1 by default): `print "\xa0" =~ /\s/ ? "space" : "no space"; # no space` [download] But see below. Apparently, the `0xc2` part ("Â") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.	[reply] [d/l] [select]
Re^4: Something strange in the world or Regexes by JavaFan (Canon) on Sep 30, 2009 at 11:38 UTC
Re^5: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 18:48 UTC
Re^5: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 11:47 UTC
Some notes below your chosen depth have not been shown here
Re^3: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 10:32 UTC
`00000000 6d 6d 75 2d 6d 69 52 2d 37 30 34 c2 a0 0d 0a 6d \|mmu-miR-704....m\|` 0xa0 is an unbreakable space in e.g. latin1. c2 would be LATIN CAPITAL LETTER A WITH CIRCUMFLEX assuming latin1. Some pc charsets use chars in that region for e.g. dos-style line-drawing. Badly done pasting might have added these chars? Update: just checked UTF-8: Almut's correct: looks like you've submissions in UTF8 which accidentally use the wrong space char. Probably the submitter is preparing his file in word or something similar unsuitable. One sane approach is whitelisting as already suggested by Silas, e.g. just stripping non-alphanumerics-non-minus with e.g. `s![^a-z0-9\-]!!gio`. Note that this will also eat up space and line ends in $_. Which works, as we stick to the common subset of ASCII, which is also valid for submissions in UTF-8 and latin1. If you also see other charsets, things like GNU recode might help if enlightening submitters fails.	[reply] [d/l] [select]
Re: Something strange in the world or Regexes by Your Mother (Archbishop) on Sep 30, 2009 at 17:10 UTC
I would further recommend a general adjustment of how you approach it. You are doing black list filtering, i.e., I don't want "\s." White listing is safer and makes it harder to violate expectations. The following is a stab at it from your sample data. `$temp =~ /\b(mmu-miR-\d\d[ab\d])\b/; print "$1 is all I want\n";` [download] Like ikegami said, encoding issues might need to be handled first.	[reply] [d/l]