Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
... my $name = "John Doe Joe"; # should match my $name = "John D"; # should match my $name = "John "; # should not match my $name = "John"; # should not match # Match only if after a space it find a letter if($name=~/[\s+.]+/) { @values = split /[\s+.]+/, $name; print "\n@values\n"; }else{ print "\n No spaces: $name\n"; } ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Check for Spaces in a String
by toolic (Bishop) on Jun 15, 2015 at 18:25 UTC | |
Outputs:
| [reply] [d/l] [select] |
by Anonymous Monk on Jun 15, 2015 at 18:29 UTC | |
$name=~/\s+\w+/ | [reply] [d/l] |
|
Re: Check for Spaces in a String
by aaron_baugher (Curate) on Jun 15, 2015 at 20:52 UTC | |
To check for a space followed by a word character is simple, though there are a few similar patterns that might serve your needs best:
However, since you're applying a regex here, it might be just as efficient to go ahead and do the split and then see whether it split anything. That would take a bit more time on the lines that are a single word, but less time on the ones with multiple words:
Update: I thought I'd benchmark it (code below), and found that if 50% of the values needed to be split as in the example above, the two methods were equally fast:
But when I made it so 75% of the values needed to be split, the "split everything and then check for a second element" method was the clear winner:
So it looks like if less than half your lines will need to be split, check first, then split the ones that matched. If more than half will end up being split, just split them all and check for a second element in the resulting array, and go from there. (Incidentally, checking for the second element ($v[1]) was also a gain over checking the number of elements (@v>1) as I originally did.) Here's the benchmarking code:
Aaron B. | [reply] [d/l] [select] |
|
Re: Check for Spaces in a String
by kcott (Archbishop) on Jun 16, 2015 at 13:20 UTC | |
You may be better off doing the initial check without using the regex engine; only using it with split where necessary. As you can see from ++aaron_baugher's analysis, your results will depend on your real data. Furthermore, if your volume of data is small, your choice of solution may make little difference (in terms of runtime). Here's a solution using substr, rindex and length for the initial check. As a proof-of-concept to show that these functions work on characters (as opposed to bytes), I've included single-byte and multi-byte characters in the data.
Output: A B C D E ☿ ♀ ♁ ♂ ♃ [The Unicode range of characters labelled "Astrological symbols" is 0x263d to 0x2647. There is no charname for "VENUS" or "MARS"; the charnames "FEMALE SIGN" and "MALE SIGN" are defined for these symbols, respectively.] Here's a benchmark test. This uses my sample data. If you choose this route, you should benchmark with representative samples of your data.
Here's three sample runs:
With my sample data, doing the initial check without a regex appears faster. Again, I'll stress, you'll need to check with your data. -- Ken | [reply] [d/l] [select] |
|
Re: Check for Spaces in a String
by talexb (Chancellor) on Jun 16, 2015 at 18:36 UTC | |
In my opinion, you should only be using a regular expression when the simpler solutions can't handle the problem. In this case, you can manage by just using split. Here's how:
Rather than only using split after you've gone through a regexp, I'd just use split and look at the result you get. Running this gives me the following useful output:
| [reply] [d/l] [select] |