Match first word in line fails

yacoubean has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match first word in line fails by mr_mischief (Monsignor) on Jan 30, 2005 at 07:53 UTC
Maybe I'm just daft tonight, as may be the case at after 1 AM my time. Perl has powerful features in its regex engine. I suggest using them and being done with it. Is there some reason I'm missing that the following doesn't work? `#!/usr/bin/perl -w use strict; my %bad_words = ( 'crud' => 'dirt', 'crap' => 'dung', 'poo' => 'dung', 'wanker' => 'fool' ); my $bad_re = join '\|', keys %bad_words; while ( <DATA> ) { s/\b($bad_re)\b/$bad_words{lc($1)}/ieg; print; } __DATA__ "Wanker!" said I. "Crap," says Travis. This is horse poo. Poo, I tell you! "Poo upon all the crud and crap the wanker could see."` [download] When I tested this, it did what I think is being sought. Other than the building of the wordlist hash and the displaying (or writing to file, or assigning to a variable) of the changed text, it's just one line outside the loop to get ready for the regex, and one inside the loop to do the work. Here's my output: `"fool!" said I. "dung," says Travis. This is horse dung. dung, I tell you! "dung upon all the dirt and dung the fool could see."` [download] There's no need, as far as I can tell, for anything besides search and replace when what's wanted is search and replace. Update: Looking back at this, the /e on the s/// isn't necessary. Christopher E. Stith	[reply] [d/l] [select]
Re^2: Match first word in line fails by ysth (Canon) on Jan 30, 2005 at 10:36 UTC
I would make it slightly more complicated, by calling replaceword("$1") in the substitution: `sub replaceword { my $w = $_[0]; my $repl = $bad_words{lc($w)}; if (lc($w) eq $w) { lc($repl); } elsif (ucfirst($w) eq $w) { ucfirst($repl); } else { uc($repl); } }` [download]	[reply] [d/l]
Re^3: Match first word in line fails by mr_mischief (Monsignor) on Jan 31, 2005 at 06:33 UTC
If getting the right case is important, some extra steps may be needed. I think this version of replaceword() would be sufficient, as either lowercase or ucfirst() would be the normal casing. All-uppercase as a default disallows strange mixed case renderings, so while it would preserve some effect, it's still not a complete solution. `sub replaceword { my $w = $_[0]; my $repl = $bad_words{ lc( $w ) }; $repl = ucfirst( $repl ) if ( $w =~ /^[A-Z]/ ); return $repl; }` [download] Of course, this would require that the eval flag stay on the substitution in my previous post, contrary to the update which applies to the code as appears in that node. Christopher E. Stith	[reply] [d/l]
Re^4: Match first word in line fails by ysth (Canon) on Feb 02, 2005 at 06:52 UTC
Re: Match first word in line fails by holli (Abbot) on Jan 30, 2005 at 04:57 UTC
I´d rewrite your code as follows: `#!/usr/bin/perl use strict; use warnings; my $badwords = join("\|", @badwords); while (<$book>) { s/[\"'.,!?:;()[\]{}\|\\\/-]/ /g; #replace all punctuation with a sp +ace s/\b($badwords)\b/replaceword($1)/eig; #replace all bad words, ign +ore case }` [download] Update: It will be much easier for us to tell, if you give us more sample data. holli, regexed monk	[reply] [d/l]
Re^2: Match first word in line fails by lidden (Curate) on Jan 30, 2005 at 05:10 UTC
Yes, but he did not want to remove punctuation from the results. So I guess the first line in the loop shoud be dropped too.	[reply]
Re^2: Match first word in line fails by yacoubean (Scribe) on Jan 30, 2005 at 05:29 UTC
I guess that's better, it finds more words than before, but it still isn't finding all of the words. But alas, I'm too tired to work on it more at this hour. I'll hit it more tomorrow and let you guys know. -------------- My home on the web: http://www.techfeed.net/	[reply]
Re: Match first word in line fails by CountZero (Bishop) on Jan 30, 2005 at 14:51 UTC
There is nothing wrong with your code. You are extracting all words and matching all badwords. I slightly adapted your code and added some print-statements to see what happens: use strict; use warnings; my @badwords = qw/crud crap poo wanker/; while (<DATA>) { my $line = $_; my $plainline = $line; $plainline =~ s/["'.,!?:;\-()[\]{}\|\\\/]/ /g; #replace all punctua +tion with a space my @sentence = split(/ /,$plainline); print join '\|', @sentence,"\n"; foreach my $word (@sentence) { chomp($word); print "\tChecking $word\n"; my $whichword = 0; # to track which badword was found foreach my $badword (@badwords) { if (lc($word) eq $badword) { print "\t\tFound $badword\n"; #my $newword = replaceword($whichword); #get a cleaner + word to replace the naughty word #$line =~ s/($word)/$newword/i; } $whichword++; } } #$cleanbook .= $line; } __DATA__ "Wanker!" said I. "Crap," says Travis. This is horse poo. Poo, I tell you! "Poo upon all the crud and crap the wanker could see." [download] And the output of this is: \|Wanker\|\|\|said\|I\| \| Checking ** Checking Wanker Found wanker Checking Checking Checking said Checking I Checking \|Crap\|\|\|says\|Travis\| \| Checking Checking Crap Found crap Checking Checking Checking says Checking Travis Checking ** This\|is\|horse\|poo\|\|Poo\|\|I\|tell\|you\| \| Checking This Checking is Checking horse Checking poo Found poo Checking ** Checking Poo Found poo Checking ** Checking I Checking tell Checking you Checking \|Poo\|upon\|all\|the\|crud\|and\|crap\|the\|wanker\|could\|see\|\| \| Checking Checking Poo Found poo Checking upon Checking all Checking the Checking crud Found crud Checking and Checking crap Found crap Checking the Checking wanker Found wanker Checking could Checking see Checking Checking [download] Maybe the error is in your `replaceword` sub? As you see, you have lots of empty items in your wordslist since you replace punctuation by spaces and then split on spaces, effectively dropping all spaces but making a lot of empty "words". CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re^2: Match first word in line fails by Qiang (Friar) on Jan 31, 2005 at 02:58 UTC
guess we should point him out to brian's Guide to Solving Any Perl Problem	[reply]
Re: Match first word in line fails by Popcorn Dave (Abbot) on Jan 30, 2005 at 07:02 UTC
Here's another way to approach the problem. I'm not sure it's a better way than holli layed out for you but if you've got a bunch of words to replace I'd think a hash might be the way to go. Unfortunately this doesn't save your puncuation, and tonight I'm at a loss on how to split and save puncuation. With that said, here's my attempt: `#!/usr/bin/perl use strict; use warnings; my ($line, $cleanbook); my %badwords = (crud => "darn", shit => "shoot"); while ($line = <DATA>) { $line =~ s/["'.,!?:;\-()[\]{}\|\\\/]/ /g; #replace all punctuation +with a space my @sentence = split(/\W+/,$line); @sentence = map {$badwords{lc($_)}?$badwords{lc($_)}:$_} @sentence +; $cleanbook .= join(' ', @sentence)."\n"; } print $cleanbook; __DATA__ "Crud," said Travis "Shit, this crud is shit!"` [download] Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply] [d/l]
Re^2: Match first word in line fails by Anonymous Monk on Jan 31, 2005 at 17:47 UTC
Change this: `my @sentence = split(/\W+/,$line);` to: `my @sentence = split(/(\W+)/,$line);` Now @sentence will contain the punctuation as well.	[reply] [d/l] [select]
Re: Match first word in line fails by Anonymous Monk on Jan 31, 2005 at 00:33 UTC
This might help: `use Regexp::Common qw(profanity); my %bad2good = (...); while(<>) { s/($RE{profanity})/$bad2good{lc $1}/gi; print; }` [download] You'll have to come up with your own bad word replacement list. If the regexp doesn't work for you as is, you can either edit the module as you like, or submit a bug report.	[reply] [d/l]
Re: Match first word in line fails by frostman (Beadle) on Jan 31, 2005 at 01:34 UTC
Well yacoubean, I wish I could be more help but it's 2 in the morning and all I could think of is the following, which is food for thought at best. #!/usr/bin/perl # hmm, try bleeping... use strict; my @badwords = qw(crud poop darn); my $cleanbook; # i wonder if it will work... my $match = join("\|",@badwords); # lazy, not checking badwords. while (<DATA>) { # ahh, inefficiency! while ($_ =~ /$match/i) { my $bad = $&; (my $bloop = $bad) =~ s/(?<=\w)\w/\/sg; $_ =~ s/$bad/$bloop/sg; } $cleanbook .= $_; } print $cleanbook; __DATA__ "Crud," said Travis. But then, Travis was always something of a poopy-head. He could never keep to his darned self. Darn it Travis! [download] This gives me: `[frost@louddrunk]$ perl t.pl "C," said Travis. But then, Travis was always something of a py-head. He could never keep to his ded self. D* it Travis!` [download] ...which may or may not be useful in your case. I was just thinking that if I were in your shoes I'd probably rather bl**p the profanities than replace them with gentler versions. Erasing them requires too much skill with `Grammar::Parse`, and replacing them with other words is, well, putting words in people's mouths. On the other hand, the profane among us are quite used to the occasional obfuscation...	[reply] [d/l] [select]
Re^2: Match first word in line fails by frostman (Beadle) on Jan 31, 2005 at 01:43 UTC
Oh, and of course you can make it a little less bleepy like so: `(my $bloop = $bad) =~ s/(?<=\w{2})\w/\/sg;` [download] ...which I think makes it more legible: `"Cr," said Travis. But then, Travis was always something of a poy-head. He could never keep to his daed self. Da* it Travis!` [download]	[reply] [d/l] [select]
Re: Match first word in line fails by yacoubean (Scribe) on Jan 31, 2005 at 16:52 UTC
Wow! I didn't have time to mess with this problem yesterday, but I see that you guys went crazy and gave me some excellent tips! I'm back at work now, so I'll work on it more tonight, then let you know who won. ;) -------------- My home on the web: http://www.techfeed.net/	[reply]
Re^2: Match first word in line fails by yacoubean (Scribe) on Feb 01, 2005 at 03:29 UTC
Again, thanks a bunch guys. I ended up using mr_mischief's and ysth's suggestions. The other thing I didn't mention here is that I want to keep stats on the potty language that's replaced, so I added that in and now it all works the way I wanted! :) -------------- My home on the web: http://www.techfeed.net/	[reply]