Re: Something strange in the world or Regexes

Try to post a link to the real input. Downloading it I see no space at the line ends at all:

00000000 6d 6d 75 2d 6d 69 52 2d 37 30 34 a0 0d 0a 6d 6d |mmu-miR-704...mm|

Which is slightly funny alright (LFA0+CRLF; thx to ikegami below), but possibly an artefact of the site markup + pasting. And probably NOT what you're using. (If I just paste from the node, I also see a trailing 0x20 space, like Corion)

also to check: how do you get the input: $temp - does it contain any LF or CR line endings?

update: also to check: char encoding of the stuff you get? cat -vet/hd/od -x might help in figuring out things (the last two being examples of those "hex dumpers" on unix or in cygwin, xxd is probably also widely available and used for vim's pseudo hex mode).

Comment on Re: Something strange in the world or Regexes Download Code

Replies are listed 'Best First'.
Re^2: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 11:00 UTC
A0+CRLF, not 0A+CRLF	[reply]
Re^2: Something strange in the world or Regexes by mrguy123 (Hermit) on Sep 30, 2009 at 10:04 UTC
Good idea! The link is here As you can now see, there is a weird character at the end of each line. It seems we now know what the problem is. Only question is, how did it get into the input and how can I regex it away?	[reply]
Re^3: Something strange in the world or Regexes by almut (Canon) on Sep 30, 2009 at 10:22 UTC
It's the UTF-8 encoding (`0xC2 0xA0`) of the non-breaking space (which is not included in the "whitespace" set of chars¹ — thus your regex didn't match). ___ ¹ update: at least not the iso-latin-1 encoding of the character, i.e. `0xA0` (for backwards compatibility, Perl assumes iso-latin-1 by default): `print "\xa0" =~ /\s/ ? "space" : "no space"; # no space` [download] But see below. Apparently, the `0xc2` part ("Ā") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.	[reply] [d/l] [select]
Re^4: Something strange in the world or Regexes by JavaFan (Canon) on Sep 30, 2009 at 11:38 UTC
If one has 5.10 or later, one can use `/\h/` which will match a non-breaking space, regardless whether the string is encoded in UTF8 or not.	[reply] [d/l]
Re^5: Something strange in the world or Regexes by ikegami (Patriarch) on Sep 30, 2009 at 18:48 UTC
Re^5: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 11:47 UTC
Re^6: Something strange in the world or Regexes by JavaFan (Canon) on Sep 30, 2009 at 14:08 UTC
Re^3: Something strange in the world or Regexes by jakobi (Pilgrim) on Sep 30, 2009 at 10:32 UTC
`00000000 6d 6d 75 2d 6d 69 52 2d 37 30 34 c2 a0 0d 0a 6d \|mmu-miR-704....m\|` 0xa0 is an unbreakable space in e.g. latin1. c2 would be LATIN CAPITAL LETTER A WITH CIRCUMFLEX assuming latin1. Some pc charsets use chars in that region for e.g. dos-style line-drawing. Badly done pasting might have added these chars? Update: just checked UTF-8: Almut's correct: looks like you've submissions in UTF8 which accidentally use the wrong space char. Probably the submitter is preparing his file in word or something similar unsuitable. One sane approach is whitelisting as already suggested by Silas, e.g. just stripping non-alphanumerics-non-minus with e.g. `s![^a-z0-9\-]!!gio`. Note that this will also eat up space and line ends in $_. Which works, as we stick to the common subset of ASCII, which is also valid for submissions in UTF-8 and latin1. If you also see other charsets, things like GNU recode might help if enlightening submitters fails.	[reply] [d/l] [select]