Difficulty with UTF-8 and file contents

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I have two questions.

The first is how to get \*DATA, <DATA>, __DATA__ or whatever into the open() function below. I think I need that in order to illustrate the real problem which might be with the open() function. Or it might not be.

The real question is that the final result from print $xhtml->as_XML_indented; in the script below is showing mangled text and not presenting it as UTF-8. Whether that is happening during reading or printing I do not know and ask your collective wisdom on how to get the script to produce UTF-8 in its final result. It should show "smart" quotes around the "Def" string.

#!/usr/bin/perl

use utf8;
use HTML::TreeBuilder::XPath;

use warnings;
use strict;


my $xhtml = HTML::TreeBuilder::XPath->new;
$xhtml->implicit_tags(1);
$xhtml->no_space_compacting(1);

my $filehandle;
 open ($filehandle, "< :encoding(UTF-8)", \*DATA)
    or die("Could not open file 'DATA' : error: $!\n");

# parse UTF-8
$xhtml->parse_file($filehandle)
    or die("Could not parse file handle for 'DATA' : $!\n");

close ($filehandle);

print $xhtml->as_XML_indented;
$xhtml->delete;

exit(0);


__DATA__

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Foo</title>
  </head>
  <body>
    <p><a href="https://example.com/3/">Abc
      â&#128;&#152;Defâ&#128;&#153; (GHI)</a></p>
  </body>
</html>
[download]

Comment on Difficulty with UTF-8 and file contents Select or Download Code

Replies are listed 'Best First'.

Re: Difficulty with UTF-8 and file contents
by haukex (Archbishop) on Apr 13, 2020 at 08:52 UTC

I'm going to take a guess that, because unfortunately PerlMonks' <code> tags do HTML-escaping of Unicode characters, what you pasted in the HTML is actually â€˜Defâ€™, which shows that you appear to already have encoding issues somewhere - the original string is probably ‘Def’.

k-mx already pointed out that the DATA filehandle doesn't need to be opened, and the use utf8; already causes the source code, including the __DATA__ section, to be read as UTF-8 (Special Literals). The following works fine for me, that is, the source file and the output are both UTF-8 (and I've made sure to accommodate for PerlMonks' Unicode oddities):

#!/usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw/:std :utf8/;
use HTML::TreeBuilder::XPath;

my $xhtml = HTML::TreeBuilder::XPath->new;
$xhtml->implicit_tags(1);
$xhtml->no_space_compacting(1);
$xhtml->parse_file(*DATA) or die $!;
print $xhtml->as_XML_indented;

__DATA__

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Foo</title>
  </head>
  <body>
    <p><a href="https://example.com/3/">Abc
      ‘Def’ (GHI)</a></p>
  </body>
</html>

Note that I added the use open qw/:std :utf8/; because the script is also printing Unicode strings. So basically, I think that you need to inspect your source file's encoding (my script enctool might be helpful).

[reply]
[d/l]
[select]

Re: Difficulty with UTF-8 and file contents
by k-mx (Scribe) on Apr 13, 2020 at 06:39 UTC

DATA is already opened. I don't know exact page with information about __DATA__, but I think perldoc SelfLoader will be enough for you.

P.S. You can use perldoc perltoc to navigate through documentation in future.

[reply]
[d/l]
[select]

Re^2: Difficulty with UTF-8 and file contents

by Athanasius (Archbishop) on Apr 13, 2020 at 06:59 UTC

I don't know exact page with information about __DATA__

See perldata#Special-Literals. For whatever reason, this “Special Literals” documentation appears in the Scalar value constructors section of perldata.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]

Re^3: Difficulty with UTF-8 and file contents

by mldvx4 (Hermit) on Apr 13, 2020 at 10:47 UTC

Thanks, everybody.

It came down to some gradual, one-step-at-a-time debugging combined with your advice above.

The wrong code which caused the problem:

    my $xhtml = HTML::TreeBuilder::XPath->new;
    $xhtml->implicit_tags(1);
    $xhtml->parse_file($file)
        or die("Could not parse '$file' : $!\n");
[download]

The code which prevented the mutilation of the data:

    . . .
    use open qw/:std :utf8/;
    . . .

    my $xhtml = HTML::TreeBuilder::XPath->new;
    $xhtml->implicit_tags(1);
    my $filehandle;
    open ($filehandle, "<", $file)
        or die("Could not open file '$file' : error: $!\n");
    $xhtml->parse_file($filehandle)
        or die("Could not parse file handle for '$file' : $!\n");
[download]

So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

[reply]
[d/l]
[select]

Re^4: Difficulty with UTF-8 and file contents

by haukex (Archbishop) on Apr 13, 2020 at 17:26 UTC

Re^2: Difficulty with UTF-8 and file contents

by mldvx4 (Hermit) on Apr 13, 2020 at 08:00 UTC

Thanks. I should have posted that part separately, it's only secondary to the main question.

I have seen perldoc SelfLoader and the perldata manual page too. They are rather abstract and I don't quite get how to use the __DATA__ token in the context above. Is there a concrete example somewhere of how to use the main::DATA file handle with the 3-part open() function?

[reply]
[d/l]
[select]

Re: Difficulty with UTF-8 and file contents
by choroba (Cardinal) on Apr 13, 2020 at 13:14 UTC

binmode

binmode *DATA, ':encoding(UTF-8)';
while (<DATA>) {  # This populates $_ with decoded Unicode data, not b
+ytes.
    ...
)
[download]

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]
[select]

Re^2: Difficulty with UTF-8 and file contents

by ikegami (Patriarch) on Apr 14, 2020 at 01:53 UTC

That's useless. DATA, being the handle used by perl to read the source, is affected by the existing use utf8;.

use if $ARGV[0], "utf8";
printf "%vX\n", scalar(<DATA>);
__DATA__
é
[download]

$ perl a.pl 0
C3.A9.A

$ perl a.pl 1
E9.A
[download]

The OP's problem appears to be a lack of encoding of the output, not a lack of decoding of the input.

[reply]
[d/l]
[select]

Re: Difficulty with UTF-8 and file contents
by Anonymous Monk on Apr 15, 2020 at 13:19 UTC

haukex hit the nail on the head when he observed that the behavior is not in this user's program, but in the implementation of HTML::TreeBuilder when given a file-name string versus a handle. (And, perhaps this package's behavior is erroneous and should be changed ...)

[reply]