Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: Difficulty with UTF-8 and file contents

by Athanasius (Archbishop)
on Apr 13, 2020 at 06:59 UTC ( [id://11115423]=note: print w/replies, xml ) Need Help??


in reply to Re: Difficulty with UTF-8 and file contents
in thread Difficulty with UTF-8 and file contents

I don't know exact page with information about __DATA__

See perldata#Special-Literals. For whatever reason, this “Special Literals” documentation appears in the Scalar value constructors section of perldata.

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

  • Comment on Re^2: Difficulty with UTF-8 and file contents

Replies are listed 'Best First'.
Re^3: Difficulty with UTF-8 and file contents
by mldvx4 (Friar) on Apr 13, 2020 at 10:47 UTC

    Thanks, everybody.

    It came down to some gradual, one-step-at-a-time debugging combined with your advice above.

    The wrong code which caused the problem:

    my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file) or die("Could not parse '$file' : $!\n");

    The code which prevented the mutilation of the data:

    . . . use open qw/:std :utf8/; . . . my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); my $filehandle; open ($filehandle, "<", $file) or die("Could not open file '$file' : error: $!\n"); $xhtml->parse_file($filehandle) or die("Could not parse file handle for '$file' : $!\n");

    So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

      So if I guess right, the use of a file handle which I have opened myself under the influence of the use open qw/:std :utf8/; pragma forced the data going into HTML::TreeBuilder::XPath to be read as UTF-8?

      Yes, that's correct. Note the documentation of parse_file in HTML::TreeBuilder:

      ... When you pass a filename to parse_file, HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing. One solution is to open the file yourself using the proper :encoding layer, and pass the filehandle to parse_file. ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11115423]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-23 15:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found