http://qs1969.pair.com?node_id=1233707

ftherese has asked for the wisdom of the Perl Monks concerning the following question:

I've been using Perl (5.18.2) with PurePerl in a chroot to manipulate UTF-8 text files containing accents for a few years. I just recently decided to bring my code into a more modern setting (5.26.1) so that I could more easily invite collaborators. I'm not receiving any error messages, but now I'm getting blank lines instead of my previous output. How should I begin debugging what I think to be parsing or encoding problems? I checked the encoding and it appears to match in both the chroot and the new environment. The PurePerl mod installed in the new environment with no errors.

Replies are listed 'Best First'.
Re: Parsing Problems (updated)
by haukex (Archbishop) on May 13, 2019 at 16:08 UTC

    I'd suggest creating a Short, Self-Contained, Correct Example - i.e. reducing both the input file and the code down to the bare minimum needed to reproduce the problem. That should help you narrow down the problem, and it also gives you something to post here. To remove all ambiguity, also use hexdump or od to show the input files, e.g. hexdump -C input.txt or od -tx1c input.txt*, and use Devel::Peek to show the data once it is read into Perl. For example:

    $ hexdump -C test.txt 00000000 48 e2 82 ac 6c 6c 6f 2c 20 57 c3 b6 72 6c 64 21 |H...llo, +W..rld!| 00000010 0a |.| 00000011 $ cat test.pl use warnings; use strict; use open qw/:std :encoding(UTF-8)/; use Devel::Peek; while (<>) { chomp; Dump($_); } $ perl test.pl test.txt SV = PV(0x55b66e50c080) at 0x55b66e547398 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x55b662e533d0 "H\342\202\254llo, W\303\266rld!"\0 [UTF8 "H\x{2 +0ac}llo, W\x{f6}rld!"] CUR = 16 LEN = 81

    Update: To look at the input file, you might also be interested in my script enctool.

    * Update 2: On Windows, one simple open-source hex dump tool (of many) can be downloaded here: https://github.com/hollasch/hex/releases/download/v1.1.0/hex.exe

      Thank you for the suggestion. I have now ruled out the thought that this is a parsing error. I now believe it has something to do with how a function is being called:

      $temp = eval qq~modes::~.$ARGV[2].qq~::flex(/$line);~; It's supposed to run a subroutine from a Perl module file in a directory (lib, opendir, readdir). Has something changed with package, lib, eval, or qq~? I'm trying to figure out how distill my example.
        Has something changed with package, lib, eval, or qq~?

        I'm not certain, but there may have been small changes to how eval handles Unicode, but evalbytes was introduced in 5.16, so before the 5.18 that you're migrating away from. Other than that, based on this little piece of code, I don't see anything that might be different between the two versions. And whether something has changed with Text::Unaccent::PurePerl, you'd have to check what version you have installed in both environments (perl -MText::Unaccent::PurePerl -le 'print $Text::Unaccent::PurePerl::VERSION').

        If you think you're having trouble with eval, then it's best if you built its argument first, stored it in a variable, used Data::Dumper or Data::Dump to show it, and also check eval for errors using the pattern eval "...; 1" or die "eval failed: $@" (see Bug in eval in pre-5.14). (Update before posting: I see haj made a similar point.)

        However, I would strongly recommend against using eval in the first place, building Perl code from strings and trying to run it can be quite brittle, and in many cases even a major security risk. If you were to show more context (SSCCE), we could most likely suggest an alternative without eval (Update: Yep!).

        Two things stand out here:

        • $ARGV[2] might contain un-decoded UTF-8 characters in "modern" terminals. File systems with UTF-8 characters behave differently, depending on the platform (Windows/Unix).
        • (/$line) looks weird. That needs really special values for $line to produce valid Perl!

        An idea is to print the string you're evaling, and of course checking whether eval produced an error in $@:

        my $code = qq~modes::~ . $ARGV[2] . qq~::flex(/$line);~; warn $code; eval $code; warn "Eval failed: '$@'" if $@;

        I would really, really suggest replacing the eval() with something like

        my $name_space = "modes::$ARGV[2]"; my $code = $name_space->can( 'flex' ) or die "$name_space does not implement flex()"; $temp = $code->( $line );

        This is on the hypothesis that in fat-fingering in your example you reversed the slope of a back slash.

        When obscure code fails it is really hard to debug. I infer from your eval() example that what you are trying to do is this:

        1. Run a script whose second argument is the name of a processor for your input.
        2. Process each line of the file using a subroutine named flex() in a Perl module named "modes::$ARGV[2]"
        3. This module has already been loaded.

        If these are correct, the above code should implement it in an easier-to-read-and-debug manner. The three statements compute the name of the module, get the address of the subroutine in that module (failing if it can not be found), and call the subroutine, passing it the line from the file, and storing the result in $temp.

        This is off-topic for your question, but all your Perl should have use strict; and use warnings; near the top of the file. If your script does not do this, all sorts of bugs could lurk. Of course, if you just slam these into a legacy script it may find all sorts of things -- more than you are able to fix in one sitting. But it's worth a try -- one of the warnings/errors may tell you what your problem is.

Re: Parsing Problems
by haj (Vicar) on May 13, 2019 at 16:07 UTC
    Please apologize my ignorance: What is "PurePerl"?
      Sorry: Text::Unaccent::PurePerl

        Thanks!

        In this case I'd stick with the suggestion provided by haukex: Dump your input and output files with a tool like hexdump. There is a chance that while changing your version of Perl you might also have changed "something else" in your environment (e.g. a console default encoding).