ShayShay has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I have a txt document I'm trying to reformat so that it can be used in print. I read somewhere that I can UTF-8 by specifying it at the beginning of the document use utf8; and can refer to it in regular expression substitutions or translations as in this example: $s =~ s/-/\x{2014}/g;. This should turn a hyphen into an em dash correct? My larger problem is that I have a string on which I would like to do a global substitution. The problem is, I only want to do the substitutions on the hyphens which are surrounded by 3 digits on both sides. There's a lot of other hyphens in the string and I don't want to have to go through and make a whole bunch of substrings or split the string into an array at those particular hyphens. Is there a way to match a regex but to only substitute the hyphen in it? Here's and example of non-working code:
my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0 +12-457."; $str =~ s/\d{3}-\d{3}/\d{3}\x{2014}\d{3}/g;
(Yes, I am aware that the result would be grammatically incorrect.) Unfortunately, this throws an error: "Unrecognized escape \d passed through"

Replies are listed 'Best First'.
Re: match substitution
by ikegami (Patriarch) on Jan 27, 2010 at 00:37 UTC

    I read somewhere that I can UTF-8 by specifying it at the beginning of the document use utf8;

    use utf8; only specifies that the source is UTF-8. If you're reading data from a file, for example, you'll still need to decode that.

    open(my $fh, '<:encoding(UTF-8)', $qfn) or die("Can't open file \"$qfn\": $!\n");

    Don't forget to encode your output.

    s/-/\x{2014}/g; This should turn a hyphen into an em dash correct?

    Yes.

    \x{2014} works even without use utf8;. It refers to character U+2014, no matter which encoding was used for the source.

    The problem is, I only want to do the substitutions on the hyphens which are surrounded by 3 digits on both sides.

    The approach you are taking require captures:

    s/([0-9]{3})-([0-9]{3})/$1\x{2014}$2/g

    But captures aren't needed here.

    s/(?<=[0-9]{3})-(?=[0-9]{3})/\x{2014}/g

    (\d matches some pretty funky stuff in addition to 0-9)

    The latter snippet has the advantage of properly handling 123-456-789.

Re: match substitution
by umasuresh (Hermit) on Jan 27, 2010 at 00:41 UTC
    I tried the following back reference:
    use utf8; my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0 +12-457."; $str =~ s/(\d{3})-(\d{3})/\1\_\2/g; print "$str\n"
    which produced the following result:
    725_275 is an entry - and will be at 423_569 -but- not at 012_457.

    I am getting some strange characters if I use \x{2014}! I am not sure about the hyphen to dash substitution. Update: I just saw ikegami's reply!
      Actually, it produces
      \1 better written as $1 at a.pl line 3. \2 better written as $2 at a.pl line 3. 725_275 is an entry - and will be at 423_569 -but- not at 012_457.
      if you don't disable warnings. \1 and \2 are not a valid Perl variables.
        Thanks for pointing it out. You are right, I get the same when I turn the warnings on!
        UPDATE:
        use strict; use warnings; #use utf8; my $str = "725-275 is an entry - and will be at 423-569 -but- not at 0 +12-457."; print "before: $str\n"; $str =~ s/([0-9]{3})-([0-9]{3})/$1\x{2014}$2/g; print "after: $str\n" produces: before: 725-275 is an entry - and will be at 423-569 -but- not at 012- +457. after: 725—275 is an entry - and will be at 423—569 -but- not at 012—4 +57.