mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I am working on a bioinformatics web program connected with micro RNAs. As input, the user gives me a list of micro RNAs (miRs), and I recognize miR targets (not relevant to my question, but a bit of background is always nice).
The user interface is written in PHP, and the actual work is done in Perl. The input miRs are saved in an input file, which the Perl program parses and processes.
For the past few days, we have been experiencing a strange bug on some of the input. On closer inspection, it seems that this happened when the input has trailing spaces (see input file below).
mmu-miR-704  mmu-miR-219  mmu-miR-145  mmu-miR-29b  mmu-miR-701  mmu-miR-34a  mmu-miR-150  mmu-miR-362 
However, the strange thing is, that the space in the end is not a space. Every regex I try to delete the space doesn't work. Even more,
if ($temp =~ /\s/){ print "Found Space!!\n"; }
doesn't find anything!
I can use different regexes to try and work around this problem, but I would really like to know what this mysterious character is. Any suggestions?

Thanks, Mr Guy (mrguy123)

The Road goes ever on and on Down from the door where it began.
----Bilbo Baggins

Update: Here is a link to the input. As you can see, a non-core character is indeed at the end of each line. The question is why, and how to get rid of it.

Replies are listed 'Best First'.
Re: Something strange in the world or Regexes
by Corion (Patriarch) on Sep 30, 2009 at 09:52 UTC

    Maybe something went wrong when pasting your code, because the following "hex dumper" shows me a \x20 from a line pasted:

    perl -lne "printf qq([%s] => %02x\n), $_, ord $_ for split //" mmu-miR-150 [m] => 6d [m] => 6d [u] => 75 [-] => 2d [m] => 6d [i] => 69 [R] => 52 [-] => 2d [1] => 31 [5] => 35 [0] => 30 [ ] => 20
      It makes sense that by copying and pasting the code I got rid of the "non core" character. Can you explain what you did with the "hex dumper"?
      Thanks, mrguy

        Something I find quite helpful with one liners is B::Deparse:

        >perl -MO=Deparse -lne "printf qq([%s] => %02x\n), $_, ord $_ for spli +t //" BEGIN { $/ = "\n"; $\ = "\n"; } LINE: while (defined($_ = <ARGV>)) { chomp $_; printf "[%s] => %02x\n", $_, ord $_ foreach (split(//, $_, 0)); }

        So that program goes through the input, splits it between each byte, and then feeds each byte as a character and as the number to the printf function.

Re: Something strange in the world or Regexes
by ikegami (Patriarch) on Sep 30, 2009 at 10:52 UTC

    You need to decode the bytes into characters before treating them as characters:

    $ perl -e'print "\xc2\xa0"' | perl -le' binmode STDIN, ":encoding(UTF-8)"; chomp( $_ = <STDIN> ); print /\s/ ? "space" : "no space"; ' space

      Interesting observation (to me at least):

      The single line of binmode stdin :encoding(X) does extend \s for both X = LATIN1 and UTF8 to include the non-breakable space. Without it, regardless of LANG and LC_* variables set in the shell, you've the old semantics for \s. Not (yet?) mentioned in perldoc -f binmode, but OTOH it mentions a nice way to flush STDIN.

      Looks like binmode and PerlIO got way more interesting in the meantime :).

        Good catch! You caught me giving the brief answer. Let's start demonstrating what you describe:
        $ perl -MEncode -le' $_ = decode("UTF-8", encode("UTF-8", "\xA0")); print /\s/ ? "space" : "no space"; $_ = decode("iso-latin-1", encode("iso-latin-1", "\xA0")); print /\s/ ? "space" : "no space"; $_ = "\xA0"; print /\s/ ? "space" : "no space"; $_ = "\xA0\x{2660}"; print /^\s/ ? "space" : "no space"; ' space space no space space

        Regex matching follow two sets of rules: "byte semantics" and "unicode semantics".[*1] The set of rules used is determined by the internal encoding of the string used to build the pattern and/or the internal encoding of the string against which the pattern is being matched.[*2]

        By default, strings are internally encoded as iso-latin-1 if possible.[*3] On the other hand, the decoding facilities of Encode, utf8 and PerlIO::encoding return strings internally encoded as utf8. This enables unicode semantics on matching.

        Under byte semantics, \s matches whitespace in the ASCII range only. Under unicode semantics, \s matches anything Unicode considers whitespace[*4], which include NBSP (U+00A0).

        The internal encoding of a string can be manipulated using utf8::ugprade and utf8::downgrade

        *1 — This post doesn't discuss the effects of use locale, if any.

        *2 — Expect (backwards compatible) changes in this area in 5.12.

        *3 — This post doesn't discuss the effects of (broken) use encoding, if any.

        *4 — There are bugs in many properties, but I don't think \s has any errors. These are being fixed for 5.12.

Re: Something strange in the world or Regexes
by SilasTheMonk (Chaplain) on Sep 30, 2009 at 09:49 UTC
    Given that this has ultimately come from a web page I suspect that it might be some sort of non-core character. What I would suggest is that work out what characters are allowable. Some regular expression like /^([\w\-]+)/ and use $1 to extract the string of interest. Actually you should make your regular expression match as closely as possible what is allowable. I doubt that anything malicious is going on here, but in general suspicious characters entered via web pages is a common form of hack on the internet. Taint mode (-T) is the usual defence in the perl world and you may want to read up on that even if you decide it is overkill in your case.
      That's pretty much what I did.
      However, I would like to know what's going on with this non-core character.
Re: Something strange in the world or Regexes
by jakobi (Pilgrim) on Sep 30, 2009 at 09:52 UTC

    Try to post a link to the real input. Downloading it I see no space at the line ends at all:

    00000000  6d 6d 75 2d 6d 69 52 2d  37 30 34 a0 0d 0a 6d 6d  |mmu-miR-704...mm|

    Which is slightly funny alright (LFA0+CRLF; thx to ikegami below), but possibly an artefact of the site markup + pasting. And probably NOT what you're using. (If I just paste from the node, I also see a trailing 0x20 space, like Corion)

    also to check: how do you get the input: $temp - does it contain any LF or CR line endings?

    update: also to check: char encoding of the stuff you get? cat -vet/hd/od -x might help in figuring out things (the last two being examples of those "hex dumpers" on unix or in cygwin, xxd is probably also widely available and used for vim's pseudo hex mode).

      A0+CRLF, not 0A+CRLF

      Good idea!
      The link is here

      As you can now see, there is a weird character at the end of each line. It seems we now know what the problem is.
      Only question is, how did it get into the input and how can I regex it away?

        It's the UTF-8 encoding (0xC2 0xA0) of the non-breaking space (which is not included in the "whitespace" set of chars1 — thus your regex didn't match).

        ___

        1 update: at least not the iso-latin-1 encoding of the character, i.e. 0xA0  (for backwards compatibility, Perl assumes iso-latin-1 by default):

        print "\xa0" =~ /\s/ ? "space" : "no space"; # no space

        But see below.  Apparently, the 0xc2 part ("Â") somehow got lost in your case... — simply (incorrectly) treating the UTF-8 sequence as iso-latin-1 should have left you with two characters.

        00000000  6d 6d 75 2d 6d 69 52 2d  37 30 34 **c2** **a0** 0d 0a 6d  |mmu-miR-704....m|

        0xa0 is an unbreakable space in e.g. latin1. c2 would be LATIN CAPITAL LETTER A WITH CIRCUMFLEX assuming latin1. Some pc charsets use chars in that region for e.g. dos-style line-drawing. Badly done pasting might have added these chars?

        Update: just checked UTF-8: Almut's correct: looks like you've submissions in UTF8 which accidentally use the wrong space char. Probably the submitter is preparing his file in word or something similar unsuitable.

        One sane approach is whitelisting as already suggested by Silas, e.g. just stripping non-alphanumerics-non-minus with e.g. s![^a-z0-9\-]!!gio. Note that this will also eat up space and line ends in $_. Which works, as we stick to the common subset of ASCII, which is also valid for submissions in UTF-8 and latin1. If you also see other charsets, things like GNU recode might help if enlightening submitters fails.

Re: Something strange in the world or Regexes
by Your Mother (Archbishop) on Sep 30, 2009 at 17:10 UTC

    I would further recommend a general adjustment of how you approach it. You are doing black list filtering, i.e., I don't want "\s." White listing is safer and makes it harder to violate expectations. The following is a stab at it from your sample data.

    $temp =~ /\b(mmu-miR-\d\d[ab\d])\b/; print "$1 is all I want\n";

    Like ikegami said, encoding issues might need to be handled first.