Strip user-defined words with regexp

Marcello has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Strip user-defined words with regexp by Limbic~Region (Chancellor) on Mar 04, 2004 at 14:38 UTC
Marcello, You do not say what marks the end of a sentence. In english, there are many ways to do this (period, question mark, exclamation mark, etc). Also, your regex does not look like it should work by your specifications. This will also break on the sentence "How are you today Dr. Smith?" #!/usr/bin/perl use strict; use warnings; my $msg = "One bright day in the middle of the night,\n"; $msg .= "two dead men got up to fight.\n"; $msg .= "Back to Back they faced each other,\n"; $msg .= "drew their swords and shot each other.\n"; $msg .= "A deaf police man heard this noise,\n"; $msg .= "came and killed those two dead boys.\n"; $msg .= "If you don't believe this lie is true,\n"; $msg .= "ask the blind man - he saw it too!\n"; $msg =~ tr/\n//d; for my $sentence ( split /[.!?]/ , $msg ) { if ( $sentence =~ /^\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)\s+/ ) { print "$1 $2\n"; } } __END__ One bright Back to A deaf If you [download] Cheers - L~R	[reply] [d/l]
Re: Strip user-defined words with regexp by rnahi (Curate) on Mar 04, 2004 at 14:37 UTC
I would do it this way. I don't know if it deserves an high grade, but it gets the job done :). `my $count = 0; while ( $message =~ /([A-Za-z0-9]+)/g) { last if $count++ > 1; print "$1\n"; }` [download]	[reply] [d/l]
Re: Re: Strip user-defined words with regexp by Limbic~Region (Chancellor) on Mar 04, 2004 at 14:54 UTC
rnahi, "but it gets the job done :)." Sorry to nitpick, but actually it doesn't. The first two words were being desired of each sentence. L~R Updated to clarify Updated is in italics. I am still wrong because of a misinterpretation of the OP's requirements.	[reply]
Re: Re: Re: Strip user-defined words with regexp by rnahi (Curate) on Mar 04, 2004 at 16:09 UTC
Did you try it? It prints the first 2 words as defined by the OP.	[reply]
Re: Re: Re: Re: Strip user-defined words with regexp by Limbic~Region (Chancellor) on Mar 04, 2004 at 16:22 UTC
Re: Re: Re: Strip user-defined words with regexp by Marcello (Hermit) on Mar 04, 2004 at 15:03 UTC
I think my original post was a bit too unclear. I am looking for at most the first two words. So it may be zero, one or two. I've modified his code to: `my $firstWord = undef; my $secondWord = undef; my $i = 0; while ($message =~ /([A-Za-z0-9]+)/g) { if ($i == 0) { $firstWord = $1; } elsif ($i == 1) { $secondWord = $1; last; } $i++; }` [download] which does exactly what I was looking for. I only tried to do it in one regexp. Thanks, Marcel	[reply] [d/l]
Re: Strip user-defined words with regexp by BrowserUk (Patriarch) on Mar 04, 2004 at 16:01 UTC
This might meet the reqs? `my( $word1, $word2 ) = ( grep $_, split /[^A-Za-z0-9]+/, $message )[0, +1]` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Re: Strip user-defined words with regexp by Marcello (Hermit) on Mar 04, 2004 at 19:12 UTC
Out of curiosity: why not just `my ($word1, $word2) = split(/[^A-Za-z0-9]+/, $message);` [download] ? Marcel	[reply] [d/l]
Re: Re: Re: Strip user-defined words with regexp by BrowserUk (Patriarch) on Mar 04, 2004 at 19:56 UTC
Given the OP's examplei input With the grep `$message = "\n +_ ABC1_\n2 3 4"; print join'\|', ( grep $_, split /[^A-Za-z0-9]+/ , $message )[0,1]; ABC1\|2` [download] Without `$message = "\n +_ ABC1_\n2 3 4"; print join'\|', split /[^A-Za-z0-9]+/ , $message; \|ABC1\|2\|3\|4` [download] You'll notice the null leading element. The list slice is pretty redundant, but it does make it obvious that you are only wanting the first two. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Strip user-defined words with regexp by Happy-the-monk (Canon) on Mar 04, 2004 at 14:32 UTC
What makes up or defines a sentence? The newline? The dot if it's not followed by a word character?	[reply]
Re: Re: Strip user-defined words with regexp by Marcello (Hermit) on Mar 04, 2004 at 14:46 UTC
Sorry, the term phrase is probably better. I am looking for the first two words in a phrase, the phrase can end with anything and can contain newlines, etc etc.	[reply]
Re: Strip user-defined words with regexp by halley (Prior) on Mar 04, 2004 at 14:32 UTC
This kinda sounds like homework. You might try typing the following command at your command prompt. `perldoc perlre` [download] Check out the `\w` match symbol, and ask yourself why you're using all these `*` in a regex when the problem as stated says `+`. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Re: Strip user-defined words with regexp by Marcello (Hermit) on Mar 04, 2004 at 14:41 UTC
I knew somebody was going to say this... It's not, I have an application which has to determine by the first two words of a phrase what todo. This phrase can be anything, it might even be only one word. Examples: `my $message = "test one"; my $message = "test"; my $message = "_$ test..."; my $message = "_$ TEST..1.."; my $message = "_$\nTEST1.\n.1.2.3";` [download] BTW: \w is not helping me here, since I do not want the underscore character.	[reply] [d/l]
Re: Re: Re: Strip user-defined words with regexp by Happy-the-monk (Canon) on Mar 04, 2004 at 14:54 UTC
Limbic~Region has almost got it then: `if ( $message =~ /([a-zA-Z0-9]+)[^a-zA-Z0-9]([a-zA-Z0-9])/ ) { print "$1 $2\n"; }` [download]	[reply] [d/l]