in reply to Difficulty with UTF-8 and file contents

DATA is already opened. I don't know exact page with information about __DATA__, but I think perldoc SelfLoader will be enough for you.

P.S. You can use perldoc perltoc to navigate through documentation in future.

Replies are listed 'Best First'.
Re^2: Difficulty with UTF-8 and file contents
by Athanasius (Archbishop) on Apr 13, 2020 at 06:59 UTC

      Thanks, everybody.

      It came down to some gradual, one-step-at-a-time debugging combined with your advice above.

      The wrong code which caused the problem:

      my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file) or die("Could not parse '$file' : $!\n");

      The code which prevented the mutilation of the data:

      . . . use open qw/:std :utf8/; . . . my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); my $filehandle; open ($filehandle, "<", $file) or die("Could not open file '$file' : error: $!\n"); $xhtml->parse_file($filehandle) or die("Could not parse file handle for '$file' : $!\n");

      So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

        So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

        Yes, that's correct. Note the documentation of parse_file in HTML::TreeBuilder:

        ... When you pass a filename to parse_file, HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing. One solution is to open the file yourself using the proper :encoding layer, and pass the filehandle to parse_file. ...
Re^2: Difficulty with UTF-8 and file contents
by mldvx4 (Hermit) on Apr 13, 2020 at 08:00 UTC

    Thanks. I should have posted that part separately, it's only secondary to the main question.

    I have seen perldoc SelfLoader and the perldata manual page too. They are rather abstract and I don't quite get how to use the __DATA__ token in the context above. Is there a concrete example somewhere of how to use the main::DATA file handle with the 3-part open() function?