G'day tel2,
"I'm guessing I might need to "use utf8", ..."
Sorry, but that would be a bad guess.
The documentation for the utf8 pragma states, in emboldened text:
"Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."
Your basic problem here is that the filehandle, FILE, doesn't know about the UTF-8.
Example of what's happening:
$ perl -Mutf8 -wE 'say "e-acute: é; u-acute: ú"'
e-acute: ?; u-acute: ?
Here's three ways to address this problem:
-
Use the binmode function, e.g.
$ perl -Mutf8 -wE 'binmode STDOUT => ":utf8"; say "e-acute: é; u-acute
+: ú"'
e-acute: é; u-acute: ú
-
Use the open pragma, e.g.
$ perl -Mutf8 -wE 'use open OUT => qw{:utf8 :std}; say "e-acute: é; u-
+acute: ú"'
e-acute: é; u-acute: ú
-
Use the 3-argument form of the open function
and specify the encoding in the mode. Something like this:
open my $fh, '>:encoding(UTF-8)', $filename
Here's some recommendations for your code.
This is unrelated to the UTF-8 issue.
-
Let Perl tell you about problems.
Start using the strict pragma
and the warnings pragma.
-
Your code is littered with package variables: $cgi, an object reference; $f1, a string; FILE, a filehandle; and so on.
These are all global and suffer from the same problems as all global variables.
Start using lexical variables, and control their scope, for far less error-prone code.
There's a lot of information about this in perlsub;
the "Private Variables via my()"
section would be a good place to start.
-
Don't use indirect object syntax, e.g. code like new CGI.
Here's what perlobj: Invoking Class Methods says,
in emboldened text, at the start of the Indirect Object Syntax section:
"Outside of the file handle case, use of this syntax is discouraged as it can confuse the Perl interpreter. See below for more details."
-
Start using lexical filehandles with the 3-argument form of the open function. See that document for more about this.
-
Hand-crafting I/O die messages is time-consuming and error-prone.
Let Perl do this task for you with the autodie pragma.
You can then write code like this:
use autodie;
...
open my $in_fh, '<', $infile;
open my $out_fh, '>', $outfile;
| [reply] [d/l] [select] |
G'day from across the ditch, Ken. You're talkin' my language, mate.
Thanks very much for your time and all your tips.
The reason I wrote $f1 to the file and read it back into $f2 was just to make sure the variables weren't changing in the process, and from what I can tell they aren't. I'm struggling to understand how this issue is about writing/reading the file. My reasons are:
1. If I remove my "quick hack" and change the webpage's "Output: $f1" line to "Output: $f2" (which it was meant to be originally - sorry), the e-acute appears on the webpage correctly.
2. If I print $f1 (which has not been read from a file) to the PDF (e.g. $text->text("PDF Output:$f1=$f2");) no acutes appear correctly.
3. If I write $f1 to a file as you have suggested, and read it back into $f3, it then contains more bytes than $f1, and printing $f3 to the PDF still doesn't print e-acute properly.
Below is some modified code which demonstrates this (sorry, I haven't brought it into the general coding standards you've suggested at this stage).
#!/usr/bin/perl
use lib "/home/tospeirs/perl5/lib/perl5";
use CGI;
use PDF::API2;
use bytes;
use constant mm => 25.4 / 72;
$cgi = new CGI;
$f1 = $cgi->param(f1);
if (defined($f1))
{
open (FILE, ">utf8_test1.out") or die "Can't open outfile";
print FILE $f1;
close FILE;
open (FILE, "<utf8_test1.out") or die "Can't open infile";
$f2 = <FILE>;
close FILE;
open my $fh, '>:encoding(UTF-8)', 'utf8_test2.out';
print $fh $f1;
close $fh;
open my $fh, '<:encoding(UTF-8)', 'utf8_test2.out';
$f3 = <$fh>;
close $fh;
$lengths = "Lengths: f1=" . bytes::length($f1) . ", f2=" . byt
+es::length($f2) . ", f3=" . bytes::length($f3);
$cmp = ($f1 eq $f2) ? 'f1=f2' : 'f1<>f2';
$cmp .= ($f1 eq $f3) ? ', f1=f3' : ', f1<>f3';
$pdf = PDF::API2->new();
$font1 = $pdf->corefont('Arial');
$page = $pdf->page; # Add blank page
$page->mediabox(210/mm, 297/mm);
$text = $page->text();
$text->font($font1, 28);
$text->translate(5/mm ,280/mm);
# A quick hack to handle a couple of special chars
#$f2 =~ s/\303\251/\351/g; # e-acute
#$f2 =~ s/\303\272/\372/g; # u-acute
$text->text("PDF Output:$f1=$f2=$f3");
$pdf->saveas('utf8_test1.pdf');
}
print <<EOF;
Content-Type: text/html; charset=utf-8\n
<!DOCTYPE html>
<html lang='en-NZ'>
<head>
<title>Test UTF-8</title>
<meta charset='UTF-8'>
</head>
<body>
<form method='post'>
Input: <input type='text' name='f1' value='$f1'>
<br>
<input type='submit' name='submit' value='Submit'>
<br>
Output f2: $f2
<br>
Output f3: $f3
<br>
$lengths
<br>
$cmp
</form>
</body>
</html>
EOF
This is what I see on the webpage after I submit "Cliché.":
Input: Cliché.
Submit
Output f2: Cliché.
Output f3: Cliché.
Lengths: f1=8, f2=8, f3=10
f1=f2, f1<>f3
And the PDF ends up containing this:
PDF Output:Cliché.=Cliché.=Cliché.
As you can see, none of those 3 came out right in the PDF, and the $f3 looks extra long, as if it's been double-encoded or something. Check this octal dump out:
$ od -c utf8_test1.out
0000000 C l i c h 303 251 .
$ od -c utf8_test2.out
0000000 C l i c h 303 203 302 251 .
Any ideas?
Thanks.
tel2 | [reply] [d/l] |
#!/perl
use strict;
use warnings;
use CGI;
use CGI::Carp 'fatalsToBrowser';
use PDF::API2;
use Encode;
my $cgi = new CGI;
my $f1 = $cgi->param('f1');
my $f2 = decode('UTF-8', $f1 );
open OUT,'>','c:/temp/web/pdf.txt' or die; # change path to suit
print OUT "$f1 $f2";
close OUT;
my $pdf = PDF::API2->new()->mediabox('A4');
my $text = $pdf->page->text;
my $font1 = $pdf->corefont('Arial');
$text->font($font1, 36);
$text->translate(100,500);
$text->text("f1 = $f1");
$text->translate(100,600);
$text->text("f2 = $f2");
$pdf->saveas('c:/temp/web/utf8_test1.pdf'); # change path to suit
print <<EOF;
Content-Type: text/html; charset=UTF-8\n
<!DOCTYPE html>
<html lang='en-NZ'>
<head>
<title>Test UTF-8</title>
<meta charset="UTF-8">
</head><body>
$f1 $f2
<form method="post">
Input: <input type="text" name="f1" value="$f1"><br>
<input type="submit" name="submit" value="Submit">
</form></body></html>
EOF
poj | [reply] [d/l] [select] |