davebaker has asked for the wisdom of the Perl Monks concerning the following question:

I am stumped.

use strict; use warnings; use v5.10; my @lines = <DATA>; for my $line ( @lines ) { chomp $line; say "\$line is *$line*"; } __DATA__ hello\ngoodbye good\nevil yes\nno

produces

line is *hello\n\ugoodbye* line is *good\nevil* line is *yes\nno*

rather than

line is *hello goodbye* line is *good evil* line is *yes no*

Even though a 2-letter character string looks like one of the "backslashed character escapes" described at p. 68 of Programming Perl (4th ed.) -- here, the newline -- it seems impossible to make those escapes actually "work", if read from a data file. Is there an interpolation rule or exception I'm missing?

Some of the other character escapes listed on that page are \r (carriage return) and \t (horizontal tab).

In the same way, the "translation escapes" described at p. 69, such as \u, don't seem to "work." If hello\n\ugoodbye in the data file (above) is changed to \uhello\n\ugoodbye, the printed result is *\uhello\n\ugoodbye*. I'm looking for

*Hello Goodbye*

I get the same result if I rewrite the program to read from a text file rather than using a __DATA__ block.

  • Comment on Can't get \n or other character/translation escapes to interpolate if originally read from a data file
  • Select or Download Code

Replies are listed 'Best First'.
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by AnomalousMonk (Archbishop) on Mar 15, 2021 at 17:05 UTC

    The __DATA__ (a.k.a. __END__) block doesn't do interpolation. (Update: Nor \u \U \l \L \Q \E \t \n \r \etc similarly.)

    Also remember that by default, you're reading the lines using \n (newline) as a line delimiter, so if you should succeed in embedding \n into a line that you write to a file, you'll end up with two lines!


    Give a man a fish:  <%-{-{-{-<

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by pryrt (Abbot) on Mar 15, 2021 at 17:23 UTC
    ++AnomalousMonk explained the "why".

    Here's something you can do to get the string in the format I think you want: the String::Interpolate module allows you to pass a string with character escapes and get back the string that I believe you desire.

    C:\usr\local\share>perl use strict; use warnings; use v5.10; use String::Interpolate qw/interpolate/; my @lines = <DATA>; for my $line ( @lines ) { chomp $line; say "data line is *$line*"; my $real = interpolate($line); say "real line is *$real*"; } __DATA__ hello\ngoodbye good\nevil yes\nno
    yields
    data line is *hello\ngoodbye* real line is *hello goodbye* data line is *good\nevil* real line is *good evil* data line is *yes\nno* real line is *yes no*

    note: I've never used it before today; I found it via a quick search, and its description and above behavior seemed right.

    Also, I don't have the Programming Perl to check page numbers, but "translation escapes" sound like escapes that are only available to the regex substitution engine and not to normal strings, so those will likely not be available even in String::Interpolate (unchecked).

      String::Interpolate

      However, it'll also interpolate @{[ `rm -rf /` ]}... depending on how many characters OP wants interpolated, s/\\n/\n/g might be the safer approach.

        Like I said, I had never used the String::Interpolate module before. Because they had the safe versions of their functions, I thought that they protected against things like that... but it appears not.

        If the OP wants more than just \n available as escapes in his processing, however, the list of substitutions will creep up as the rules/desires change... especially with the translation escapes. I was hoping that String::Interpolate was a way to avoid re-inventing that wheel, but I guess not.

      ... "translation escapes" sound like escapes that are only available to the regex substitution engine and not to normal strings ...

      The \u \U \l \L \Q \E "translation escapes" are also available to the double-quote operator:

      Win8 Strawberry 5.8.9.5 (32) Mon 03/15/2021 12:41:31 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings my $foo = "%^&*("; my $bar = 'Bar'; my $string = "\U$bar\E \l\U$bar\E \Q$foo\E \n"; print $string; ^Z BAR bAR \%\^\&\*\(

      Update: In fact, I think the only reason translation escapes are available to regex expressions is that double-quotish interpolation is done very early in compilation of a regex expression. IIUC, the first two steps are:

      1. Find the end of the expression;
      2. Do double-quotish interpolation.


      Give a man a fish:  <%-{-{-{-<

        "translation escapes" are also available to the double-quote operator:

        Cool. Learned something new today. :-)

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by haukex (Archbishop) on Mar 15, 2021 at 17:48 UTC

    Note that the other suggested solutions of String::Interpolate and eval introduce security risks. For something less risky, see String::Unescape, which only works with a subset of Perl's escape sequences, and AFAICT from a quick look at the source none of them appear to be security-relevant.

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by GrandFather (Saint) on Mar 15, 2021 at 20:46 UTC

    By now you've got "why it happens" and "how to work around it". But I'm interested to know "why do you want that"?. Maybe we can help you find a better way to achieve your end goal if we know what that is?

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      Thanks. My end goal was the replacement of several different strings that appear in an unknown number of text files in a particular directory. Some of the strings include a line feed. All of the strings would be replaced by a particular piece of text.

      (Backstory: The text files are newsletters that are archived on the web, the text of which sometimes includes an address that's being picked up by spammers scraping the newsletters over the web. The goal is to scrub the email address, which appears in one of several different standard sentences, and replace it with generic contact information.)

      Reach Holly Smith for help by sending an email
      to hollysmith@nosuchdomain.com.
      

      and

      For more information, contact Holly Smith at
      hollysmith@nosuchdomain.com.
      

      and

      For more information, contact Holly Smith at (800) 555-1212 or
      via email to hollysmith@nosuchdomain.com
      

      In each case, I want to replace those sentences (which have line breaks where you see them, e.g. after "sending an email" in the first one) with "For more information, contact Holly Smith." (without the quotation marks).

      The reason I got balled up with the interpolation issue is that I first tried to use a DATA block in a small script, as in:

      __DATA__ Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdo +main.com.~For more information, contact Holly Smith. For more information, contact Holly Smith at\nhollysmith@nosuchdomain. +com.~For more information, contact Holly Smith. For more information, contact Holly Smith at (800) 555-1212 or\nvia em +ail to hollysmith@nosuchdomain.com~For more information, contact Holl +y Smith.

      (There are only three lines in the DATA block, even though they almost certainly will wrap on this web page.)

      My script would read each of those lines in the DATA block, load variables $old_string and $new_string by splitting on the "~" character, and then do a

      if ($slurped_file =~ s/\Q$old_string\E/$new_string/g ) {

      kind of thing to make the replacement, ultimately resulting in updated text that would be used to replace the existing file.

      And now you see why it didn't work :-) I had used the \Q in order to escape the domain name's period and the set of parentheses in one targeted string, but of course \Q wants to do what \Q does, so it also escapes the backslash in "\n" in my targeted strings. Hence the "\n" in my DATA block records didn't seem to "work." The substitutions never took place because the files don't have strings that include a literal \ followed by n.

      Before I realized the \Q issue, though, I was convinced that something about the use of a __DATA__ section for the data had to be preventing the \n from being interpolated, and I thought I needed interpolation so that I could put, on a single line in the DATA block, an expression that would match the line breaks that occur in all three target strings. So I created the little test script shown in my original post in order to make sure \n would be the proper way to represent such a line feed. And I couldn't get the \n to turn into a line feed. I seem to have stumbled onto something that turns out to be unrelated to solving my problem!

      Basically I failed to remember that merely putting a string into a variable doesn’t cause the string to be interpolated. Otherwise there would be trouble every time a graphic file is read from disk, if its content included the character sequence “\n”, for example. (I think.)

      So this code is doing what I need it to do, even with data in $old_string that’s coming from a DATA block and includes embedded “\n” strings:

      if ($slurped_file =~ s/$old_string/$new_string/g ) {

      The two-character \n string in $old_string is still the two-character \n string when the regular expression in the substitution function is built, resulting in something like "s/Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdomain.com/For more information, contact Holly Smith./" (without the quotation marks).

      I did need to revise my DATA block a bit, to escape a couple of parentheses that otherwise would be treated as grouping operators in the regular expression (and I escaped the periods in order to avoid the inefficiency of the substitution operator treating them like wild cards):

      __DATA__ Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdo +main\.com\.~For more information, contact Holly Smith. For more information, contact Holly Smith at\nhollysmith@nosuchdomain\ +.com\.~For more information, contact Holly Smith. For more information, contact Holly Smith at \(800\) 555-1212 or\nvia +email to hollysmith@nosuchdomain\.com~For more information, contact H +olly Smith.

        A further elaboration is to use  \s+ in place of a literal space in your pattern strings (in either an array of strings or in __DATA__ records). So
            'Reach Holly Smith for help ...'
        might look like
            'Reach \s+ Holly \s+ Smith \s+ for \s+ help \s+ ...'
        Because the \s whitespace class includes \n, this has the advantage that a newline or any other combination of whitespace may appear anywhere in the target string and will be matched and replaced. E.g., the target string may be broken over any number of lines in the target text. Important Note: The s///x substitution must use the /x modifier to allow \s+ sub-patterns surrounded by spaces (for readability) to be sprinkled all over the place.


        Give a man a fish:  <%-{-{-{-<

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by haukex (Archbishop) on Mar 16, 2021 at 17:46 UTC

    Considering your responses further down in the thread, where you say you're actually looking to store sequences of search strings and replacements, I think perhaps it might be better to look at existing data serialization formats instead of inventing your own. For example, JSON is very common nowadays, or YAML is sometimes a bit more human-readable. In addition, since your search strings appear to be fairly similar, it's worth investigating whether a regex or two can match your search strings with more flexibility instead of listing all the search strings.

    use warnings; use strict; use JSON::PP; use YAML::PP; my $data = [ { search=>"Reach Holly Smith for help by sending an email\nto holl +ysmith\@nosuchdomain.com.", replacement=>"For more information, conta +ct Holly Smith." }, { search=>"For more information, contact Holly Smith at\nhollysmit +h\@nosuchdomain.com.", replacement=>"For more information, contact Ho +lly Smith." }, { search=>"For more information, contact Holly Smith at (800) 555- +1212 or\nvia email to hollysmith\@nosuchdomain.com", replacement=>"Fo +r more information, contact Holly Smith." }, ]; my $json = JSON::PP->new->pretty; print $json->encode($data); my $yaml = YAML::PP->new; print $yaml->dump_string($data); __END__ [ { "search" : "Reach Holly Smith for help by sending an email\nto h +ollysmith@nosuchdomain.com.", "replacement" : "For more information, contact Holly Smith." }, { "replacement" : "For more information, contact Holly Smith.", "search" : "For more information, contact Holly Smith at\nhollys +mith@nosuchdomain.com." }, { "replacement" : "For more information, contact Holly Smith.", "search" : "For more information, contact Holly Smith at (800) 5 +55-1212 or\nvia email to hollysmith@nosuchdomain.com" } ] --- - replacement: For more information, contact Holly Smith. search: |- Reach Holly Smith for help by sending an email to hollysmith@nosuchdomain.com. - replacement: For more information, contact Holly Smith. search: |- For more information, contact Holly Smith at hollysmith@nosuchdomain.com. - replacement: For more information, contact Holly Smith. search: |- For more information, contact Holly Smith at (800) 555-1212 or via email to hollysmith@nosuchdomain.com
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by NetWallah (Canon) on Mar 15, 2021 at 17:36 UTC
    Alternative to pryrt's module-based solution:

    my $line2; eval "\$line2=\"$line\""; # Now you can print $line2

                    "Avoid strange women and temporary variables."

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by LanX (Saint) on Mar 16, 2021 at 04:28 UTC
    Interpolation happens only in double quoted strings.

    This includes qq , "Here-Docs" and regex arguments for m//, s/// and split.

    Plenty of alternatives for DATA which is itself an embedded file with "raw" data.

    A \n there are just two characters. Better use the appropriate Ascii codes for new line.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      So, to be able to search for this 2-line string in a document that uses Unix-type line endings:

      Reach Holly Smith for help by sending an email
      To hollysmith@nosuchdomain.com.
      

      .... the character-specific way to do it (rather than using \n) would have been to store the “old string”~“new string” line in a __DATA__ block, like this, using the Unix-type newline character (code point 10, or 0A in hexadecimal):

      __DATA__
      Reach Holly Smith for help by sending an email\x{0A}to hollysmith@nosuchdomain.com.~For more information, contact Holly Smith.

      ...where the preceding line is a single line even though it almost certainly will wrap on this web page; is that what you mean?

      I think I’m coming to see the point of being able to use \n in a regular expression — it avoids the need to specify what the characters are that represent a new line in the operating system being used to write and read the file, and it’s a memorable way to avoid having to look up the code point for those characters. Plus a manual entry of a new line at the keyboard while creating a line-by-line data file would gunk up the use of

      while ( <DATA> )

        Update: davebaker changed this post (without citation!) while I was composing this reply.


        ... a data block or file that contains “old string”~“new string” lines:

        __DATA__
        ... an email\x{0A}to ....~For more information, ....
        (Is that what you mean?)

        No. (Well, at least that's not the point I would make. :)

        The point I would make is that the string you get | read from a __DATA__ or __END__ block or from a regular file is essentially the same as a single-quoted string defined in a script, and such a string can be used directly as a regex search pattern:

        Win8 Strawberry 5.8.9.5 (32) Tue 03/16/2021 17:26:59 C:\@Work\Perl\monks\davebaker >perl -Mstrict -Mwarnings my $s = 'foo bar'; print "A: >>$s<< \n"; my $search = 'foo\nbar'; # note single quotes! print "B: >>$search<< \n"; # \n is '\n' my $replace = "hoo-ray"; # can be single/double quotes $s =~ s/$search/$replace/; # no /g - one replacement only print "C: >>$s<< \n"; ^Z A: >>foo bar<< B: >>foo\nbar<< C: >>hoo-ray<<

        If the search string/pattern is held in a file, the process is similar, except you usually need to chomp the string before you use it:

        Win8 Strawberry 5.8.9.5 (32) Tue 03/16/2021 17:28:26 C:\@Work\Perl\monks\davebaker >type search.dat foo\nbar >perl -Mstrict -Mwarnings my $s = 'foo bar'; print "A: >>$s<< \n"; open my $fh, '<', 'search.dat' or die "opening: $!"; chomp(my $search = <$fh>); print "B: >>$search<< \n"; # \n is essentially '\n' my $replace = "hoo-ray"; $s =~ s/$search/$replace/; print "C: >>$s<< \n"; ^Z A: >>foo bar<< B: >>foo\nbar<< C: >>hoo-ray<<
        I think that if you use '\n' (or the equivalent from a file) in a regex search pattern and if you use default I/O for reading all your files, then you will be able to do automatic text editing in an OS-agnostic way, at least across the Windows/*nix iron curtain. The '\n' sequence in a regex is the universal representation of a default newline.

        (In general, I think use of qr// is definitely best practice for defining search regexes in a script, not single- or double-quoted strings, but if you're reading from a file, you're kinda stuck with what you've got.)


        Give a man a fish:  <%-{-{-{-<

        I'm not sure I understand your question.

        If you want OS sensitive line breaks in DATA you'll need to translate them.

        I'd suggest using plain "enter" in the input and

        $output = join "\n", <DATA>

        In case of OS problems when reading DATA by line, just adjust $/ before. (Never happened to me *)

        > Is that what you mean?

        I suggested using HERE-docs instead of DATA. They are per default interpolated.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        update

        *) actually, this problem can't arise, because Perl reads DATA like it's own code, its the same filehandle. I.e. the script won't run if there where any problems with OS specific line-endings

Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file
by davebaker (Pilgrim) on Mar 16, 2021 at 22:36 UTC

    The Take-away for Me

    These comments have been so helpful. I've learned some important things about interpolation, and DATA blocks.

    At the end of the day, it seems the use of a parseable DATA block was probably too lazy and fancy at the same time, just to be able to write down and then use an easily editable list of regular expressions, which I would use a text editor to change in the future. The more straightforward way is to code the list of regular expressions into a Perl data structure in the script, in much the same way that one would code the value of a constant towards the top of the script, so that it can be spotted and edited in the future.

    my @reg_exes = ( q|Reach Holly Smith for help by sending an email\nto hollysmith@nosu +chdomain\.com\.|, q|For more information, contact Holly Smith at\nhollysmith@nosuchdom +ain\.com\.|, q|For more information, contact Holly Smith at \(800\) 555-1212 or\n +via email to hollysmith@nosuchdomain\.com|, ); my $new_string = 'For more information, contact Holly Smith.'; # Then, after creating a list of eligible files in a particular direct +ory # and then opening each file to slurp its contents into $slurped file # (code not shown here)... for my $reg_ex ( @reg_exes ) { if ( $slurped_file =~ s/$reg_ex/$new_string/g ) { print "Matched one or more occurrences of reg ex '$reg_ex' and + substituted '$new_string' each time\n"; } } # Then store the changed file, etc.

    Given that the substitution operation potentially will be applied to several thousand files, it's probably better to precompile the regular expressions:

    my @patterns = ( q|Reach Holly Smith for help by sending an email\nto hollysmith@nosu +chdomain\.com\.|, q|For more information, contact Holly Smith at\nhollysmith@nosuchdom +ain\.com\.|, q|For more information, contact Holly Smith at \(800\) 555-1212 or\n +via email to hollysmith@nosuchdomain\.com|, ); my @reg_exes; for my $pat ( @patterns ) { push @reg_exes, qr/$pat/; } my $new_string = 'For more information, contact Holly Smith.'; # Then, after creating a list of eligible files in a particular direct +ory # and then opening each file to slurp its contents into $slurped file # (code not shown here)... for my $reg_ex ( @reg_exes ) { if ( $slurped_file =~ s/$reg_ex/$new_string/g ) { print "Matched one or more occurrences of reg ex '$reg_ex' and + substituted '$new_string' each time\n"; } } # Then store the changed file, etc.
      I'd recommend
      my @here_docs = ( <<'__RE__', <<'__RE__', <<'__RE__' ); Reach Holly Smith for help by sending an email to hollysmith@nosuchdomain.com. __RE__ For more information, contact Holly Smith at hollysmith@nosuchdomain.com. __RE__ For more information, contact Holly Smith at (800) 555-1212 or via email to hollysmith@nosuchdomain.com __RE__

      in combination with quotemeta

      not sure what the q|...\n...| in single quotes in your code is supposed to do, do you plan to match \ and n literally?

      update

      ok I got the literal \n part

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery