mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I have two questions.

The first is how to get \*DATA, <DATA>, __DATA__ or whatever into the open() function below. I think I need that in order to illustrate the real problem which might be with the open() function. Or it might not be.

The real question is that the final result from print $xhtml->as_XML_indented; in the script below is showing mangled text and not presenting it as UTF-8. Whether that is happening during reading or printing I do not know and ask your collective wisdom on how to get the script to produce UTF-8 in its final result. It should show "smart" quotes around the "Def" string.

#!/usr/bin/perl use utf8; use HTML::TreeBuilder::XPath; use warnings; use strict; my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->no_space_compacting(1); my $filehandle; open ($filehandle, "< :encoding(UTF-8)", \*DATA) or die("Could not open file 'DATA' : error: $!\n"); # parse UTF-8 $xhtml->parse_file($filehandle) or die("Could not parse file handle for 'DATA' : $!\n"); close ($filehandle); print $xhtml->as_XML_indented; $xhtml->delete; exit(0); __DATA__ <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Foo</title> </head> <body> <p><a href="https://example.com/3/">Abc â&#128;&#152;Defâ&#128;&#153; (GHI)</a></p> </body> </html>

Replies are listed 'Best First'.
Re: Difficulty with UTF-8 and file contents
by haukex (Archbishop) on Apr 13, 2020 at 08:52 UTC

    I'm going to take a guess that, because unfortunately PerlMonks' <code> tags do HTML-escaping of Unicode characters, what you pasted in the HTML is actually ‘Def’, which shows that you appear to already have encoding issues somewhere - the original string is probably ‘Def’.

    k-mx already pointed out that the DATA filehandle doesn't need to be opened, and the use utf8; already causes the source code, including the __DATA__ section, to be read as UTF-8 (Special Literals). The following works fine for me, that is, the source file and the output are both UTF-8 (and I've made sure to accommodate for PerlMonks' Unicode oddities):

    #!/usr/bin/env perl
    use warnings;
    use strict;
    use utf8;
    use open qw/:std :utf8/;
    use HTML::TreeBuilder::XPath;
    
    my $xhtml = HTML::TreeBuilder::XPath->new;
    $xhtml->implicit_tags(1);
    $xhtml->no_space_compacting(1);
    $xhtml->parse_file(*DATA) or die $!;
    print $xhtml->as_XML_indented;
    
    __DATA__
    
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
        <title>Foo</title>
      </head>
      <body>
        <p><a href="https://example.com/3/">Abc
          ‘Def’ (GHI)</a></p>
      </body>
    </html>
    

    Note that I added the use open qw/:std :utf8/; because the script is also printing Unicode strings. So basically, I think that you need to inspect your source file's encoding (my script enctool might be helpful).

Re: Difficulty with UTF-8 and file contents
by k-mx (Scribe) on Apr 13, 2020 at 06:39 UTC

    DATA is already opened. I don't know exact page with information about __DATA__, but I think perldoc SelfLoader will be enough for you.

    P.S. You can use perldoc perltoc to navigate through documentation in future.

        Thanks, everybody.

        It came down to some gradual, one-step-at-a-time debugging combined with your advice above.

        The wrong code which caused the problem:

        my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file) or die("Could not parse '$file' : $!\n");

        The code which prevented the mutilation of the data:

        . . . use open qw/:std :utf8/; . . . my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); my $filehandle; open ($filehandle, "<", $file) or die("Could not open file '$file' : error: $!\n"); $xhtml->parse_file($filehandle) or die("Could not parse file handle for '$file' : $!\n");

        So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

      Thanks. I should have posted that part separately, it's only secondary to the main question.

      I have seen perldoc SelfLoader and the perldata manual page too. They are rather abstract and I don't quite get how to use the __DATA__ token in the context above. Is there a concrete example somewhere of how to use the main::DATA file handle with the 3-part open() function?

Re: Difficulty with UTF-8 and file contents
by choroba (Cardinal) on Apr 13, 2020 at 13:14 UTC
    There's no need to reopen the DATA handle, you can specify its encoding via binmode:
    binmode *DATA, ':encoding(UTF-8)'; while (<DATA>) { # This populates $_ with decoded Unicode data, not b +ytes. ... )
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      That's useless. DATA, being the handle used by perl to read the source, is affected by the existing use utf8;.

      use if $ARGV[0], "utf8"; printf "%vX\n", scalar(<DATA>); __DATA__ é
      $ perl a.pl 0 C3.A9.A $ perl a.pl 1 E9.A

      The OP's problem appears to be a lack of encoding of the output, not a lack of decoding of the input.

Re: Difficulty with UTF-8 and file contents
by Anonymous Monk on Apr 15, 2020 at 13:19 UTC
    haukex hit the nail on the head when he observed that the behavior is not in this user's program, but in the implementation of HTML::TreeBuilder when given a file-name string versus a handle. (And, perhaps this package's behavior is erroneous and should be changed ...)