match substitution

ShayShay has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I have a txt document I'm trying to reformat so that it can be used in print. I read somewhere that I can UTF-8 by specifying it at the beginning of the document use utf8; and can refer to it in regular expression substitutions or translations as in this example: $s =~ s/-/\x{2014}/g;. This should turn a hyphen into an em dash correct? My larger problem is that I have a string on which I would like to do a global substitution. The problem is, I only want to do the substitutions on the hyphens which are surrounded by 3 digits on both sides. There's a lot of other hyphens in the string and I don't want to have to go through and make a whole bunch of substrings or split the string into an array at those particular hyphens. Is there a way to match a regex but to only substitute the hyphen in it? Here's and example of non-working code:

my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0
+12-457."; 
$str =~ s/\d{3}-\d{3}/\d{3}\x{2014}\d{3}/g;
[download]

(Yes, I am aware that the result would be grammatically incorrect.) Unfortunately, this throws an error: "Unrecognized escape \d passed through"

Comment on match substitution Select or Download Code

Replies are listed 'Best First'.
Re: match substitution by ikegami (Patriarch) on Jan 27, 2010 at 00:37 UTC
I read somewhere that I can UTF-8 by specifying it at the beginning of the document use utf8; `use utf8;` only specifies that the source is UTF-8. If you're reading data from a file, for example, you'll still need to decode that. `open(my $fh, '<:encoding(UTF-8)', $qfn) or die("Can't open file \"$qfn\": $!\n");` [download] Don't forget to encode your output. `s/-/\x{2014}/g;` This should turn a hyphen into an em dash correct? Yes. `\x{2014}` works even without `use utf8;`. It refers to character U+2014, no matter which encoding was used for the source. The problem is, I only want to do the substitutions on the hyphens which are surrounded by 3 digits on both sides. The approach you are taking require captures: `s/([0-9]{3})-([0-9]{3})/$1\x{2014}$2/g` [download] But captures aren't needed here. `s/(?<=[0-9]{3})-(?=[0-9]{3})/\x{2014}/g` [download] (`\d` matches some pretty funky stuff in addition to `0-9`) The latter snippet has the advantage of properly handling `123-456-789`.	[reply] [d/l] [select]
Re: match substitution by umasuresh (Hermit) on Jan 27, 2010 at 00:41 UTC
I tried the following back reference: `use utf8; my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0 +12-457."; $str =~ s/(\d{3})-(\d{3})/\1\_\2/g; print "$str\n"` [download] which produced the following result: `725_275 is an entry - and will be at 423_569 -but- not at 012_457.` [download] I am getting some strange characters if I use \x{2014}! I am not sure about the hyphen to dash substitution. Update: I just saw ikegami's reply!	[reply] [d/l] [select]
Re^2: match substitution by ikegami (Patriarch) on Jan 27, 2010 at 00:42 UTC
Actually, it produces `\1 better written as $1 at a.pl line 3. \2 better written as $2 at a.pl line 3. 725_275 is an entry - and will be at 423_569 -but- not at 012_457.` [download] if you don't disable warnings. `\1` and `\2` are not a valid Perl variables.	[reply] [d/l] [select]
Re^3: match substitution by umasuresh (Hermit) on Jan 27, 2010 at 01:05 UTC
Thanks for pointing it out. You are right, I get the same when I turn the warnings on! UPDATE: `use strict; use warnings; #use utf8; my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0 +12-457."; print "before: $str\n"; $str =~ s/([0-9]{3})-([0-9]{3})/$1\x{2014}$2/g; print "after: $str\n" produces: before: 725-275 is an entry - and will be at 423-569 -but- not at 012- +457. after: 725—275 is an entry - and will be at 423—569 -but- not at 012—4 +57.` [download]	[reply] [d/l]
Re^4: match substitution by ShayShay (Acolyte) on Jan 27, 2010 at 12:25 UTC