Can't get \n or other character/translation escapes to interpolate if originally read from a data file

davebaker has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by AnomalousMonk (Archbishop) on Mar 15, 2021 at 17:05 UTC
The `__DATA__` (a.k.a. `__END__`) block doesn't do interpolation. (Update: Nor `\u \U \l \L \Q \E \t \n \r \etc` similarly.) Also remember that by default, you're reading the lines using `\n` (newline) as a line delimiter, so if you should succeed in embedding `\n` into a line that you write to a file, you'll end up with two lines! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by pryrt (Abbot) on Mar 15, 2021 at 17:23 UTC
++AnomalousMonk explained the "why". Here's something you can do to get the string in the format I think you want: the String::Interpolate module allows you to pass a string with character escapes and get back the string that I believe you desire. `C:\usr\local\share>perl use strict; use warnings; use v5.10; use String::Interpolate qw/interpolate/; my @lines = <DATA>; for my $line ( @lines ) { chomp $line; say "data line is $line"; my $real = interpolate($line); say "real line is $real"; } __DATA__ hello\ngoodbye good\nevil yes\nno` [download] yields `data line is hello\ngoodbye real line is hello goodbye data line is good\nevil real line is good evil data line is yes\nno real line is yes no` [download] note: I've never used it before today; I found it via a quick search, and its description and above behavior seemed right. Also, I don't have the Programming Perl to check page numbers, but "translation escapes" sound like escapes that are only available to the regex substitution engine and not to normal strings, so those will likely not be available even in String::Interpolate (unchecked).	[reply] [d/l] [select]
Re^2: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by haukex (Archbishop) on Mar 15, 2021 at 17:35 UTC
String::Interpolate However, it'll also interpolate @{[ `rm -rf /` ]}... depending on how many characters OP wants interpolated, `s/\\n/\n/g` might be the safer approach.	[reply] [d/l] [select]
Re^3: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by pryrt (Abbot) on Mar 15, 2021 at 19:06 UTC
Like I said, I had never used the String::Interpolate module before. Because they had the `safe` versions of their functions, I thought that they protected against things like that... but it appears not. If the OP wants more than just \n available as escapes in his processing, however, the list of substitutions will creep up as the rules/desires change... especially with the translation escapes. I was hoping that String::Interpolate was a way to avoid re-inventing that wheel, but I guess not.	[reply] [d/l]
Re^4: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by haukex (Archbishop) on Mar 15, 2021 at 19:12 UTC
Re^2: Can't get \n or other character/translation escapes to interpolate if originally read from a data file (updated) by AnomalousMonk (Archbishop) on Mar 15, 2021 at 17:53 UTC
... "translation escapes" sound like escapes that are only available to the regex substitution engine and not to normal strings ... The `\u \U \l \L \Q \E` "translation escapes" are also available to the double-quote operator: `Win8 Strawberry 5.8.9.5 (32) Mon 03/15/2021 12:41:31 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings my $foo = "%^&("; my $bar = 'Bar'; my $string = "\U$bar\E \l\U$bar\E \Q$foo\E \n"; print $string; ^Z BAR bAR \%\^\&\\(` [download] Update: In fact, I think the only reason translation escapes are available to regex expressions is that double-quotish interpolation is done very early in compilation of a regex expression. IIUC, the first two steps are: Find the end of the expression; Do double-quotish interpolation. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by pryrt (Abbot) on Mar 15, 2021 at 19:02 UTC
"translation escapes" are also available to the double-quote operator: Cool. Learned something new today. :-)	[reply]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by haukex (Archbishop) on Mar 15, 2021 at 17:48 UTC
Note that the other suggested solutions of String::Interpolate and eval introduce security risks. For something less risky, see String::Unescape, which only works with a subset of Perl's escape sequences, and AFAICT from a quick look at the source none of them appear to be security-relevant.	[reply]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by GrandFather (Saint) on Mar 15, 2021 at 20:46 UTC
By now you've got "why it happens" and "how to work around it". But I'm interested to know "why do you want that"?. Maybe we can help you find a better way to achieve your end goal if we know what that is? Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re^2: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by davebaker (Pilgrim) on Mar 16, 2021 at 02:16 UTC
Thanks. My end goal was the replacement of several different strings that appear in an unknown number of text files in a particular directory. Some of the strings include a line feed. All of the strings would be replaced by a particular piece of text. (Backstory: The text files are newsletters that are archived on the web, the text of which sometimes includes an address that's being picked up by spammers scraping the newsletters over the web. The goal is to scrub the email address, which appears in one of several different standard sentences, and replace it with generic contact information.) Reach Holly Smith for help by sending an email to hollysmith@nosuchdomain.com. and For more information, contact Holly Smith at hollysmith@nosuchdomain.com. and For more information, contact Holly Smith at (800) 555-1212 or via email to hollysmith@nosuchdomain.com In each case, I want to replace those sentences (which have line breaks where you see them, e.g. after "sending an email" in the first one) with "For more information, contact Holly Smith." (without the quotation marks). The reason I got balled up with the interpolation issue is that I first tried to use a DATA block in a small script, as in: `__DATA__ Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdo +main.com.~For more information, contact Holly Smith. For more information, contact Holly Smith at\nhollysmith@nosuchdomain. +com.~For more information, contact Holly Smith. For more information, contact Holly Smith at (800) 555-1212 or\nvia em +ail to hollysmith@nosuchdomain.com~For more information, contact Holl +y Smith.` [download] (There are only three lines in the DATA block, even though they almost certainly will wrap on this web page.) My script would read each of those lines in the DATA block, load variables $old_string and $new_string by splitting on the "~" character, and then do a `if ($slurped_file =~ s/\Q$old_string\E/$new_string/g ) {` [download] kind of thing to make the replacement, ultimately resulting in updated text that would be used to replace the existing file. And now you see why it didn't work :-) I had used the \Q in order to escape the domain name's period and the set of parentheses in one targeted string, but of course \Q wants to do what \Q does, so it also escapes the backslash in "\n" in my targeted strings. Hence the "\n" in my DATA block records didn't seem to "work." The substitutions never took place because the files don't have strings that include a literal \ followed by n. Before I realized the \Q issue, though, I was convinced that something about the use of a __DATA__ section for the data had to be preventing the \n from being interpolated, and I thought I needed interpolation so that I could put, on a single line in the DATA block, an expression that would match the line breaks that occur in all three target strings. So I created the little test script shown in my original post in order to make sure \n would be the proper way to represent such a line feed. And I couldn't get the \n to turn into a line feed. I seem to have stumbled onto something that turns out to be unrelated to solving my problem! Basically I failed to remember that merely putting a string into a variable doesn�t cause the string to be interpolated. Otherwise there would be trouble every time a graphic file is read from disk, if its content included the character sequence �\n�, for example. (I think.) So this code is doing what I need it to do, even with data in $old_string that�s coming from a DATA block and includes embedded �\n� strings: `if ($slurped_file =~ s/$old_string/$new_string/g ) {` [download] The two-character \n string in $old_string is still the two-character \n string when the regular expression in the substitution function is built, resulting in something like "s/Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdomain.com/For more information, contact Holly Smith./" (without the quotation marks). I did need to revise my DATA block a bit, to escape a couple of parentheses that otherwise would be treated as grouping operators in the regular expression (and I escaped the periods in order to avoid the inefficiency of the substitution operator treating them like wild cards): `__DATA__ Reach Holly Smith for help by sending an email\nto hollysmith@nosuchdo +main\.com\.~For more information, contact Holly Smith. For more information, contact Holly Smith at\nhollysmith@nosuchdomain\ +.com\.~For more information, contact Holly Smith. For more information, contact Holly Smith at $800$ 555-1212 or\nvia +email to hollysmith@nosuchdomain\.com~For more information, contact H +olly Smith.` [download]	[reply] [d/l] [select]
Re^3: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by AnomalousMonk (Archbishop) on Mar 16, 2021 at 04:55 UTC
A further elaboration is to use `\s+` in place of a literal space in your pattern strings (in either an array of strings or in `__DATA__` records). So `'Reach Holly Smith for help ...'` might look like `'Reach \s+ Holly \s+ Smith \s+ for \s+ help \s+ ...'` Because the `\s` whitespace class includes `\n`, this has the advantage that a newline or any other combination of whitespace may appear anywhere in the target string and will be matched and replaced. E.g., the target string may be broken over any number of lines in the target text. Important Note: The `s///x` substitution must use the `/x` modifier to allow `\s+` sub-patterns surrounded by spaces (for readability) to be sprinkled all over the place. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by haukex (Archbishop) on Mar 16, 2021 at 17:46 UTC
Considering your responses further down in the thread, where you say you're actually looking to store sequences of search strings and replacements, I think perhaps it might be better to look at existing data serialization formats instead of inventing your own. For example, JSON is very common nowadays, or YAML is sometimes a bit more human-readable. In addition, since your search strings appear to be fairly similar, it's worth investigating whether a regex or two can match your search strings with more flexibility instead of listing all the search strings. use warnings; use strict; use JSON::PP; use YAML::PP; my $data = [ { search=>"Reach Holly Smith for help by sending an email\nto holl +ysmith\@nosuchdomain.com.", replacement=>"For more information, conta +ct Holly Smith." }, { search=>"For more information, contact Holly Smith at\nhollysmit +h\@nosuchdomain.com.", replacement=>"For more information, contact Ho +lly Smith." }, { search=>"For more information, contact Holly Smith at (800) 555- +1212 or\nvia email to hollysmith\@nosuchdomain.com", replacement=>"Fo +r more information, contact Holly Smith." }, ]; my $json = JSON::PP->new->pretty; print $json->encode($data); my $yaml = YAML::PP->new; print $yaml->dump_string($data); __END__ [ { "search" : "Reach Holly Smith for help by sending an email\nto h +ollysmith@nosuchdomain.com.", "replacement" : "For more information, contact Holly Smith." }, { "replacement" : "For more information, contact Holly Smith.", "search" : "For more information, contact Holly Smith at\nhollys +mith@nosuchdomain.com." }, { "replacement" : "For more information, contact Holly Smith.", "search" : "For more information, contact Holly Smith at (800) 5 +55-1212 or\nvia email to hollysmith@nosuchdomain.com" } ] --- - replacement: For more information, contact Holly Smith. search: \|- Reach Holly Smith for help by sending an email to hollysmith@nosuchdomain.com. - replacement: For more information, contact Holly Smith. search: \|- For more information, contact Holly Smith at hollysmith@nosuchdomain.com. - replacement: For more information, contact Holly Smith. search: \|- For more information, contact Holly Smith at (800) 555-1212 or via email to hollysmith@nosuchdomain.com [download]	[reply] [d/l]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by NetWallah (Canon) on Mar 15, 2021 at 17:36 UTC
Alternative to pryrt's module-based solution: `my $line2; eval "\$line2=\"$line\""; # Now you can print $line2` [download] "Avoid strange women and temporary variables."	[reply] [d/l]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by LanX (Saint) on Mar 16, 2021 at 04:28 UTC
Interpolation happens only in double quoted strings. This includes `qq` , "Here-Docs" and regex arguments for m//, s/// and split. Plenty of alternatives for DATA which is itself an embedded file with "raw" data. A `\n` there are just two characters. Better use the appropriate Ascii codes for new line. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^2: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by davebaker (Pilgrim) on Mar 16, 2021 at 12:25 UTC
So, to be able to search for this 2-line string in a document that uses Unix-type line endings: Reach Holly Smith for help by sending an email To hollysmith@nosuchdomain.com. .... the character-specific way to do it (rather than using \n) would have been to store the �old string�~�new string� line in a __DATA__ block, like this, using the Unix-type newline character (code point 10, or 0A in hexadecimal): __DATA__ Reach Holly Smith for help by sending an email\x{0A}to hollysmith@nosuchdomain.com.~For more information, contact Holly Smith. ...where the preceding line is a single line even though it almost certainly will wrap on this web page; is that what you mean? I think I�m coming to see the point of being able to use \n in a regular expression � it avoids the need to specify what the characters are that represent a new line in the operating system being used to write and read the file, and it�s a memorable way to avoid having to look up the code point for those characters. Plus a manual entry of a new line at the keyboard while creating a line-by-line data file would gunk up the use of while ( <DATA> )	[reply]
Re^3: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by AnomalousMonk (Archbishop) on Mar 16, 2021 at 21:58 UTC
Update: davebaker changed this post (without citation!) while I was composing this reply. ... a data block or file that contains �old string�~�new string� lines: `__DATA__` `... an email\x{0A}to ....~For more information, ....` (Is that what you mean?) No. (Well, at least that's not the point I would make. :) The point I would make is that the string you ~~get~~ \| read from a `__DATA__` or `__END__` block or from a regular file is essentially the same as a single-quoted string defined in a script, and such a string can be used directly as a regex search pattern: `Win8 Strawberry 5.8.9.5 (32) Tue 03/16/2021 17:26:59 C:\@Work\Perl\monks\davebaker >perl -Mstrict -Mwarnings my $s = 'foo bar'; print "A: >>$s<< \n"; my $search = 'foo\nbar'; # note single quotes! print "B: >>$search<< \n"; # \n is '\n' my $replace = "hoo-ray"; # can be single/double quotes $s =~ s/$search/$replace/; # no /g - one replacement only print "C: >>$s<< \n"; ^Z A: >>foo bar<< B: >>foo\nbar<< C: >>hoo-ray<<` [download] If the search string/pattern is held in a file, the process is similar, except you usually need to chomp the string before you use it: `Win8 Strawberry 5.8.9.5 (32) Tue 03/16/2021 17:28:26 C:\@Work\Perl\monks\davebaker >type search.dat foo\nbar >perl -Mstrict -Mwarnings my $s = 'foo bar'; print "A: >>$s<< \n"; open my $fh, '<', 'search.dat' or die "opening: $!"; chomp(my $search = <$fh>); print "B: >>$search<< \n"; # \n is essentially '\n' my $replace = "hoo-ray"; $s =~ s/$search/$replace/; print "C: >>$s<< \n"; ^Z A: >>foo bar<< B: >>foo\nbar<< C: >>hoo-ray<<` [download] I think that if you use `'\n'` (or the equivalent from a file) in a regex search pattern and if you use default I/O for reading all your files, then you will be able to do automatic text editing in an OS-agnostic way, at least across the Windows/nix iron curtain. The `'\n'` sequence in a regex* is the universal representation of a default newline. (In general, I think use of `qr//` is definitely best practice for defining search regexes in a script, not single- or double-quoted strings, but if you're reading from a file, you're kinda stuck with what you've got.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by LanX (Saint) on Mar 16, 2021 at 13:03 UTC
I'm not sure I understand your question. If you want OS sensitive line breaks in `DATA` you'll need to translate them. I'd suggest using plain "enter" in the input and `$output = join "\n", <DATA>` In case of OS problems when reading DATA by line, just adjust `$/` before. (Never happened to me ) > Is that what you mean?* I suggested using HERE-docs instead of DATA. They are per default interpolated. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update *) actually, this problem can't arise, because Perl reads DATA like it's own code, its the same filehandle. I.e. the script won't run if there where any problems with OS specific line-endings	[reply] [d/l] [select]
Re: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by davebaker (Pilgrim) on Mar 16, 2021 at 22:36 UTC
The Take-away for Me These comments have been so helpful. I've learned some important things about interpolation, and DATA blocks. At the end of the day, it seems the use of a parseable DATA block was probably too lazy and fancy at the same time, just to be able to write down and then use an easily editable list of regular expressions, which I would use a text editor to change in the future. The more straightforward way is to code the list of regular expressions into a Perl data structure in the script, in much the same way that one would code the value of a constant towards the top of the script, so that it can be spotted and edited in the future. my @reg_exes = ( q\|Reach Holly Smith for help by sending an email\nto hollysmith@nosu +chdomain\.com\.\|, q\|For more information, contact Holly Smith at\nhollysmith@nosuchdom +ain\.com\.\|, q\|For more information, contact Holly Smith at $800$ 555-1212 or\n +via email to hollysmith@nosuchdomain\.com\|, ); my $new_string = 'For more information, contact Holly Smith.'; # Then, after creating a list of eligible files in a particular direct +ory # and then opening each file to slurp its contents into $slurped file # (code not shown here)... for my $reg_ex ( @reg_exes ) { if ( $slurped_file =~ s/$reg_ex/$new_string/g ) { print "Matched one or more occurrences of reg ex '$reg_ex' and + substituted '$new_string' each time\n"; } } # Then store the changed file, etc. [download] Given that the substitution operation potentially will be applied to several thousand files, it's probably better to precompile the regular expressions: my @patterns = ( q\|Reach Holly Smith for help by sending an email\nto hollysmith@nosu +chdomain\.com\.\|, q\|For more information, contact Holly Smith at\nhollysmith@nosuchdom +ain\.com\.\|, q\|For more information, contact Holly Smith at $800$ 555-1212 or\n +via email to hollysmith@nosuchdomain\.com\|, ); my @reg_exes; for my $pat ( @patterns ) { push @reg_exes, qr/$pat/; } my $new_string = 'For more information, contact Holly Smith.'; # Then, after creating a list of eligible files in a particular direct +ory # and then opening each file to slurp its contents into $slurped file # (code not shown here)... for my $reg_ex ( @reg_exes ) { if ( $slurped_file =~ s/$reg_ex/$new_string/g ) { print "Matched one or more occurrences of reg ex '$reg_ex' and + substituted '$new_string' each time\n"; } } # Then store the changed file, etc. [download]	[reply] [d/l] [select]
Re^2: Can't get \n or other character/translation escapes to interpolate if originally read from a data file by LanX (Saint) on Mar 16, 2021 at 22:56 UTC
I'd recommend `my @here_docs = ( <<'__RE__', <<'__RE__', <<'__RE__' ); Reach Holly Smith for help by sending an email to hollysmith@nosuchdomain.com. __RE__ For more information, contact Holly Smith at hollysmith@nosuchdomain.com. __RE__ For more information, contact Holly Smith at (800) 555-1212 or via email to hollysmith@nosuchdomain.com __RE__` [download] in combination with `quotemeta` not sure what the `q\|...\n...\|` in single quotes in your code is supposed to do, do you plan to match `\` and `n` literally? update ok I got the literal `\n` part Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]

update

update