Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Regex match for Any number of words?

by sbrothy (Acolyte)
on Jan 16, 2022 at 10:48 UTC ( #11140499=perlquestion: print w/replies, xml ) Need Help??

sbrothy has asked for the wisdom of the Perl Monks concerning the following question:

I'm using C++ Boost but have it set to Perl syntax. How can I match (and return) any number of words. As in an arbitrary number of them, IE. :
" THIS is a variable number of words, spaces and punctuation. "
I'd like to end up with:
THIS is a variable number of words spaces and punctuation
I've been for ighting with this. You can think about This Till I get home and find some of my sorry attepmts.... :) /Regards, Søren.

Replies are listed 'Best First'.
Re: Regex match for Any number of words?
by kcott (Archbishop) on Jan 16, 2022 at 11:40 UTC

    G'day sbrothy,

    It sounds like you just need "split /\W+/, $string"; although, I don't know what the "C++ Boost" connection is.

    $ perl -E 'say for split /\W+/, " THIS is a variable number of words, +spaces and punctuation. "' THIS is a variable number of words spaces and punctuation

    Is the string you showed representative of your data? Do you have words with hyphens or apostrophes? Are sentences with leading whitespace normal?

    The simplicity of my solution may be invalid. Perhaps you need something closer to:

    $ perl -E 'say for split /[ ,.]+/, " THIS is a variable number of word +s, spaces and punctuation. " =~ s/^[ ,.]*//r' THIS is a variable number of words spaces and punctuation

    — Ken

Re: Regex match for Any number of words?
by davido (Cardinal) on Jan 16, 2022 at 19:48 UTC

    C++ has had regular expressions (in some degree of support) in-language since C++11, which is now ten or eleven years old. Unless you have to use Boost for this, you may just want to fall back to the language-native implementations. As I was looking to remember how to retrieve matches I found the std::regex_iterator entity. And the example in the C++ reference online is close enough to what you're looking for: regex_iterator

    #include <regex> #include <iterator> #include <iostream> #include <string> int main() { const std::string s = "Quick brown fox."; std::regex words_regex("[^\\s]+"); auto words_begin = std::sregex_iterator(s.begin(), s.end(), words_regex); auto words_end = std::sregex_iterator(); std::cout << "Found " << std::distance(words_begin, words_end) << " words:\n"; for (std::sregex_iterator i = words_begin; i != words_end; ++i) { std::smatch match = *i; + std::string match_str = match.str(); std::cout << match_str << '\n'; } }

    As you can see in the example, they're counting as a word anything that doesn't contain whitespace. You probably want to also exclude punctuation. So you would want to enumerate that in the character class. It gets harder when you want to deal with apostrophes, allowing them in words, while excluding single quoted constructs. That gets complicated fast.


    Dave

Re: Regex match for Any number of words?
by tybalt89 (Prior) on Jan 16, 2022 at 15:35 UTC

    Why be negative?

    perl -le 'print for " THIS is a variable number of words, spaces and + punctuation. " =~ /\w+/g' THIS is a variable number of words spaces and punctuation
Re: Regex match for Any number of words?
by haj (Priest) on Jan 16, 2022 at 11:43 UTC
    There are several ways to do this. A rather short one is:
    my $text = " THIS is a variable number of words, spaces and punctuat +ion. "; my @list = split /\W+/,$text;

    So I'm splitting the text whenever I found one or more characters which are not words.

    I don't know whether eliminating empty strings at the beginning and the end of the list is relevant, so I'm leaving this as an exercise to the reader (hint: grep $_,@list might not do what you want if you consider numbers as words).

Re: Regex match for Any number of words?
by Anonymous Monk on Jan 16, 2022 at 12:38 UTC
      Wow, thank you all. I'd expected you'd wait and see if I did any work of my own before answering but I completely forgot about this until now. These answers were all very helpful. Again, thank you! :)

        This is ofcourse C++ and not perl. Still, for completeness sake:

        I ended up using boost::regex_token_iterator.

        So in all it's glorious simplicity it ended up being just:

        boost::regex re("\\s+");

        Regards.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11140499]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2022-08-16 10:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?