in reply to Something strange in the world or Regexes

Try to post a link to the real input. Downloading it I see no space at the line ends at all:

00000000  6d 6d 75 2d 6d 69 52 2d  37 30 34 a0 0d 0a 6d 6d  |mmu-miR-704...mm|

Which is slightly funny alright (LFA0+CRLF; thx to ikegami below), but possibly an artefact of the site markup + pasting. And probably NOT what you're using. (If I just paste from the node, I also see a trailing 0x20 space, like Corion)

also to check: how do you get the input: $temp - does it contain any LF or CR line endings?

update: also to check: char encoding of the stuff you get? cat -vet/hd/od -x might help in figuring out things (the last two being examples of those "hex dumpers" on unix or in cygwin, xxd is probably also widely available and used for vim's pseudo hex mode).

Replies are listed 'Best First'.
Re^2: Something strange in the world or Regexes
by ikegami (Patriarch) on Sep 30, 2009 at 11:00 UTC

    A0+CRLF, not 0A+CRLF

Re^2: Something strange in the world or Regexes
by mrguy123 (Hermit) on Sep 30, 2009 at 10:04 UTC
    Good idea!
    The link is here

    As you can now see, there is a weird character at the end of each line. It seems we now know what the problem is.
    Only question is, how did it get into the input and how can I regex it away?

      It's the UTF-8 encoding (0xC2 0xA0) of the non-breaking space (which is not included in the "whitespace" set of chars1 — thus your regex didn't match).

      ___

      1 update: at least not the iso-latin-1 encoding of the character, i.e. 0xA0  (for backwards compatibility, Perl assumes iso-latin-1 by default):

      print "\xa0" =~ /\s/ ? "space" : "no space"; # no space

      But see below.  Apparently, the 0xc2 part ("Â") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.

        If one has 5.10 or later, one can use /\h/ which will match a non-breaking space, regardless whether the string is encoded in UTF8 or not.
      00000000  6d 6d 75 2d 6d 69 52 2d  37 30 34 **c2** **a0** 0d 0a 6d  |mmu-miR-704....m|

      0xa0 is an unbreakable space in e.g. latin1. c2 would be LATIN CAPITAL LETTER A WITH CIRCUMFLEX assuming latin1. Some pc charsets use chars in that region for e.g. dos-style line-drawing. Badly done pasting might have added these chars?

      Update: just checked UTF-8: Almut's correct: looks like you've submissions in UTF8 which accidentally use the wrong space char. Probably the submitter is preparing his file in word or something similar unsuitable.

      One sane approach is whitelisting as already suggested by Silas, e.g. just stripping non-alphanumerics-non-minus with e.g. s![^a-z0-9\-]!!gio. Note that this will also eat up space and line ends in $_. Which works, as we stick to the common subset of ASCII, which is also valid for submissions in UTF-8 and latin1. If you also see other charsets, things like GNU recode might help if enlightening submitters fails.