Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to filter out leading and trailing spaces from user input (web form).

I do this with: $data =~ s/^\s+|\s+$/;

I'm noticing some users entering shifted-spaces (0xA0) on occasion. \s doesn't match a shifted-space.

Since this sort of regex filtering is performed in many areas of our code base, adding 0xA0 to the definition of \s seems like a good approach.

Is it easy? Can it be done on a per-script basis? Can it be done without hacking the perl source?

Replies are listed 'Best First'.
Re: Can I change the definition of '\s'?
by davorg (Chancellor) on Sep 11, 2002 at 18:48 UTC

    It can't be done (as far as I know) without hacking the Perl source.

    You can, however, create a character class that contains both \s and your "extra" character - [\s\0xA0]. You can even create a variable that contains your character class and use that in your regexes.

    my $ws = '[\s\0xA0]'; # white space $data =~ s/^$ws+|$ws+$//g;
    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Can I change the definition of '\s'?
by John M. Dlugosz (Monsignor) on Sep 11, 2002 at 18:54 UTC
    It might not seem any different than davorg's solution, but using qr instead of quotes is better. It makes the syntax less problematic if the sub-expression has any special characters in it!

    my $space= qr/[\s\xA0]/;
    Hmm, your code sample doesn't parse (you said s// not s///) and it's finding leading OR trailing space. Assuming you just forgot a slash, it will strip leading space if it exists and leave trailing, or strip trailing if there was no leading.

Re: Can I change the definition of '\s'?
by the_Don (Scribe) on Sep 11, 2002 at 19:06 UTC

    I do not know if you can change the '\s' but you can define your own character property. An example from Programming Perl is

    sub InKana{ return <<'END'; +utf8::InHiragana +utf8::InKatakana END }

    so you should be able to do this:

    sub mySpace{ return <<'END'; +utf8::IsSpace 00A0 00A0 #is this the hexadecimal for your space? END }

    This sub must also be defined in the package that needs the property. It can be imported from another module.

    Then your regexs could use /\p{MySpace}/ to check. And changing the definition can be done in one location.

    For further information Programming Perl, 3rd Edition, page 173. Online, more information about character classes can be found at the online version of perlretut.

    I hope this helps.

    the_Don
    ...making offers others can't rufuse.

    UPDATE: Changed 4th edition to 3rd edition. Too many books at my desk, I'm sorry.

      For further information Programming Perl, 4th Edition, page 173

      Hey! No fair! You're referencing books that the rest of us won't get to see for probably another two years.

      --
      <http://www.dave.org.uk>

      "The first rule of Perl club is you do not talk about Perl club."
      -- Chip Salzenberg

Re: Can I change the definition of '\s'?
by mephit (Scribe) on Sep 11, 2002 at 20:38 UTC
    As perlman:perlfaq4 says, is would be better to remove leading and trailing whitespace with two steps, not one:
    for ($string) { s/^\s+//; s/\s+$//; }

    So, I'd suggest doing it with two separate supstitutions and the custom-character-class-as-a-string that davorg and John M. Dlugosz suggested. HTH

    --

    There are 10 kinds of people -- those that understand binary, and those that don't.

Re: Can I change the definition of '\s'?
by Juerd (Abbot) on Sep 11, 2002 at 22:38 UTC

    It can be done using a source filter.

    package Filter::BackslashS; use Filter::Simple; use strict; FILTER_ONLY regex => { s/\\s/[\\s\xA0]/g };
    use Filter::BackslashS; print "\xA0" =~ /\s/ ? "Yay!\n" : "Hmmm\n";
    Please note that this only changes the meaning of \s in literal regexes.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      I think you missed a sub keyword in there...
      FILTER_ONLY regex => sub { s/\\s/[\\s\xA0]/g };
      Also, this wont always work, since perl cannot nest character classes. What about these (albeit contrived) tests?
      print "\xA0" =~ /[A-Z\s]/ ? "Yay!\n" : "Hmmm\n"; print '[]' !~ /[A-Z\s]/ ? "Yay!\n" : "Hmmm\n";

      -Blake

Re: Can I change the definition of '\s'?
by blakem (Monsignor) on Sep 12, 2002 at 03:36 UTC
    $data =~ s/^\s+|\s+$/;
    It might just be a typo, but as written that code will only trim trailing whitespace if there is no leading whitespace... In other words, if a string has leading whitespace it won't ever check for trailing whitespace. You'll need the /g modifier to get your example code to match your spec.
    $data =~ s/^\s+|\s+$/g;

    -Blake

Re: Can I change the definition of '\s'?
by Anonymous Monk on Sep 12, 2002 at 15:40 UTC
    Many thanks to all the Monks who responded to my query.

    Apologies for the typo. Not having posted to this site before, I didn't expect to have to post HTML text, and the trailing '/g;' on my regexp got changed to ';' during editing. :^(

    Of the solutions offered, defining my own character class is the most attractive. I can define it once in my application framework and make use of it in all user interface modules (about 40 so far).