using substitution and pattern matching

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: using substitution and pattern matching by steves (Curate) on Dec 18, 2004 at 15:14 UTC
You need to read about capturing matches into the $1, $2, etc. variables. Basically, you need to put what you want to match in parens. Each group of parens is a $N variable in the replacement pattern: `use strict; my $test = "this is 'inquotes' o'leary"; $test =~ s/ \'(\S*)\' / \"$1\" /g; print "$test\n";` [download] It should also be noted that your use of whitespace to find just the "words" will likely fall apart for some cases. There are better ways of doing that. Again, as part of your study, check out zero width assertions, such as \b that matches at word boundaries without actually matching a physical character. Also check out lookahead assertions. In some cases, it can also be as simple as replacing something like your \S non-whitespace match with a character class that excludes quotes.	[reply] [d/l]
Re^2: using substitution and pattern matching by Anonymous Monk on Dec 18, 2004 at 16:20 UTC
great! I knew there was a simple answer thanks	[reply]
Re^3: using substitution and pattern matching by graff (Chancellor) on Dec 18, 2004 at 17:28 UTC
It's a simple start, at least. Usage of non-alphabetic marks in text (in English, at least) will always pose some boundary cases that are really hard or basically impossible to treat with a straight-forward, procedural algorithm (and on top of that, people who create text tend to make mistakes or ignore "rules" of style). For the current task, there's the problem of the possessive apostrophe without a following "s" (because the word ends in "s") -- and sometimes, punctuation will follow a close-quote (even though style manuals say it shouldn't). Here's a worst case for you: 'You've got to talk to Miles' brother', she said. Easy for humans, hard for programs. There is a regex that will treat this one correctly: `s/ '(.)'(\W)/ "$1"$2/; # note the greedy use of "."` [download] but it will screw up on some other case that would need a non-greedy match, like: When he said 'kiss the sky,' I heard 'kiss this guy.' You just have to make a guess what sort of mistake will happen less often (and hope your data isn't really this bad). One other hint: for stuff like this, where initial and final positions in the string might make things more complicated, it's okay to "cheat" a little: add a space or some other "safe" character at the beginning and end of the string before working on the quotes, so that the edge cases can be treated just like the non-edge cases. You can take the edge padding off when you're done.	[reply] [d/l]
Re: using substitution and pattern matching by davido (Cardinal) on Dec 18, 2004 at 16:21 UTC
When you use a regular expression (or substitution regular expression) to deal with quoted material, you run into problems with balanced quotes. Of course you can hope that the embedded apostrophes have been properly escaped, and of course they should be if they're embedded within single-quoted material. But the point is that things just get a little too complicated when dealing with balanced quotes using simple regular expressions. It is for this reason that people often suggest the module Text::Balanced. It can find balanced quoted text within a string. You can use it as a tool to help you rebuild the string with your new style of quotes. If you do like using regular expressions though, you might benefit from use Regexp::Common qw/balanced/;. Dave	[reply]
Re: using substitution and pattern matching by borisz (Canon) on Dec 18, 2004 at 15:11 UTC
One way is `s/'(\w+)'/"$1"/g;` [download] Boris	[reply] [d/l]
Re^2: using substitution and pattern matching by sauoq (Abbot) on Dec 18, 2004 at 17:08 UTC
Update: Or, I could read the problem statement better. Somehow on first reading I didn't get that he was only concerned about words within single quotes. My reply in all its mistaken glory is preserved below for posterity. There are several problems with this answer. First, your `\w+` only matches word characters. It fails to match several important things that might be found within quotations... like space for instance. And punctuation. Second, and less easy to fix, is the problem that a single quote used as an apostrophe might occur prior to the quoted text. For example, what if the text were the following? O'Reilly said, 'I did not do it!' Hold on though, it gets worse. What if the text were something like this... O'Reilly swore, 'I didn't! I swear I didn't! Honestly, I didn't swindle Miss O'Keefe!' Ugh. In a case like this, it's really important to figure out what assumptions you can rely on. For instance, if we could assume that a single-quote which needs to be replaced will always be either preceded or followed by whitespace and that apostrophes will always be nestled between non-whitespace characters then our task would become exceedingly simple. That might be a bad assumption though; something like "this ol' thing?" might be present. Maybe we can assume that an ending quote mark is always preceded by punctuation and followed by whitespace. Maybe we can't. Bottom line on this poster's problem is that it can't actually be done unless some assumptions can be made and there is a way to truly differentiate an apostrophe from an opening or closing single quote mark. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]