Write special chars to PDF. UTF8?

tel2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Below is a cut-down version of my code which allows me to enter characters like e-acute (by typing Alt-130) onto a web form, and it writes them back to the web page OK, but when it writes it to a PDF, the acute chars don't appear correctly. I've worked around this for a couple of chars with the "quick hack" you can see in the code, but I'd like to have a proper fix for all special chars. I'm guessing I might need to "use utf8", or something, and I have read some articles on Unicode & UTF8, but haven't worked out what I need to do here yet.

How can I get this code to write the special chars to the PDF file correctly, without the "quick hack"?

Also, am I supposed to have:

Content-Type: text/html; charset=utf-8\n

AND

<meta charset='UTF-8'>

Or what? Seems a bit duplicated.

#!/usr/bin/perl

use CGI;
use PDF::API2;

use constant mm => 25.4 / 72;

$cgi = new CGI;
$f1 = $cgi->param(f1);

#Content-Type: text/html; charset=utf-8\n
# <meta charset='UTF-8'>

if (defined($f1))
{
        open (FILE, ">utf8_test1.out") or die "Can't open outfile";
        print FILE $f1;
        close FILE;

        open (FILE, "<utf8_test1.out") or die "Can't open infile";
        $f2 = <FILE>;
        close FILE;

        $pdf = PDF::API2->new();

        $font1 = $pdf->corefont('Arial');

        $page = $pdf->page;             # Add blank page
        $page->mediabox(210/mm, 297/mm);

        $text = $page->text();

        $text->font($font1, 28);
        $text->translate(20/mm ,280/mm);

        # A quick hack to handle a couple of special chars
        $f2 =~ s/\303\251/\351/g;  # e-acute
        $f2 =~ s/\303\272/\372/g;  # u-acute

        $text->text('PDF Output:' . $f2);

        $pdf->saveas('utf8_test1.pdf');
}

print <<EOF;
Content-Type: text/html; charset=utf-8\n
<!DOCTYPE html>
<html lang='en-NZ'>
<head>
        <title>Test UTF-8</title>
        <meta charset='UTF-8'>
</head>
<body>
<form method='post'>
        Input: <input type='text' name='f1' value='$f1'>
        <br>
        <input type='submit' name='submit' value='Submit'>
        <br>
        Output: $f1
</form>
</body>
</html>
EOF
[download]

Thanks.

Comment on Write special chars to PDF. UTF8? Download Code

Replies are listed 'Best First'.
Re: Write special chars to PDF. UTF8? by kcott (Archbishop) on Feb 12, 2016 at 13:07 UTC
G'day tel2, "I'm guessing I might need to "use utf8", ..." Sorry, but that would be a bad guess. The documentation for the utf8 pragma states, in emboldened text: "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8." Your basic problem here is that the filehandle, `FILE`, doesn't know about the UTF-8. Example of what's happening: `$ perl -Mutf8 -wE 'say "e-acute: é; u-acute: ú"' e-acute: ?; u-acute: ?` [download] Here's three ways to address this problem: Use the binmode function, e.g. `$ perl -Mutf8 -wE 'binmode STDOUT => ":utf8"; say "e-acute: é; u-acute +: ú"' e-acute: é; u-acute: ú` [download] Use the open pragma, e.g. `$ perl -Mutf8 -wE 'use open OUT => qw{:utf8 :std}; say "e-acute: é; u- +acute: ú"' e-acute: é; u-acute: ú` [download] Use the 3-argument form of the open function and specify the encoding in the mode. Something like this: `open my $fh, '>:encoding(UTF-8)', $filename` [download] Here's some recommendations for your code. This is unrelated to the UTF-8 issue. Let Perl tell you about problems. Start using the strict pragma and the warnings pragma. Your code is littered with package variables: `$cgi`, an object reference; `$f1`, a string; `FILE`, a filehandle; and so on. These are all global and suffer from the same problems as all global variables. Start using lexical variables, and control their scope, for far less error-prone code. There's a lot of information about this in perlsub; the "Private Variables via my()" section would be a good place to start. Don't use indirect object syntax, e.g. code like `new CGI`. Here's what perlobj: Invoking Class Methods says, in emboldened text, at the start of the Indirect Object Syntax section: "Outside of the file handle case, use of this syntax is discouraged as it can confuse the Perl interpreter. See below for more details." Start using lexical filehandles with the 3-argument form of the open function. See that document for more about this. Hand-crafting I/O `die` messages is time-consuming and error-prone. Let Perl do this task for you with the autodie pragma. You can then write code like this: `use autodie; ... open my $in_fh, '<', $infile; open my $out_fh, '>', $outfile;` [download] — Ken	[reply] [d/l] [select]
Re^2: Write special chars to PDF. UTF8? by tel2 (Pilgrim) on Feb 12, 2016 at 23:24 UTC
G'day from across the ditch, Ken. You're talkin' my language, mate. Thanks very much for your time and all your tips. The reason I wrote $f1 to the file and read it back into $f2 was just to make sure the variables weren't changing in the process, and from what I can tell they aren't. I'm struggling to understand how this issue is about writing/reading the file. My reasons are: 1. If I remove my "quick hack" and change the webpage's "Output: $f1" line to "Output: $f2" (which it was meant to be originally - sorry), the e-acute appears on the webpage correctly. 2. If I print $f1 (which has not been read from a file) to the PDF (e.g. $text->text("PDF Output:$f1=$f2");) no acutes appear correctly. 3. If I write $f1 to a file as you have suggested, and read it back into $f3, it then contains more bytes than $f1, and printing $f3 to the PDF still doesn't print e-acute properly. Below is some modified code which demonstrates this (sorry, I haven't brought it into the general coding standards you've suggested at this stage). #!/usr/bin/perl use lib "/home/tospeirs/perl5/lib/perl5"; use CGI; use PDF::API2; use bytes; use constant mm => 25.4 / 72; $cgi = new CGI; $f1 = $cgi->param(f1); if (defined($f1)) { open (FILE, ">utf8_test1.out") or die "Can't open outfile"; print FILE $f1; close FILE; open (FILE, "<utf8_test1.out") or die "Can't open infile"; $f2 = <FILE>; close FILE; open my $fh, '>:encoding(UTF-8)', 'utf8_test2.out'; print $fh $f1; close $fh; open my $fh, '<:encoding(UTF-8)', 'utf8_test2.out'; $f3 = <$fh>; close $fh; $lengths = "Lengths: f1=" . bytes::length($f1) . ", f2=" . byt +es::length($f2) . ", f3=" . bytes::length($f3); $cmp = ($f1 eq $f2) ? 'f1=f2' : 'f1<>f2'; $cmp .= ($f1 eq $f3) ? ', f1=f3' : ', f1<>f3'; $pdf = PDF::API2->new(); $font1 = $pdf->corefont('Arial'); $page = $pdf->page; # Add blank page $page->mediabox(210/mm, 297/mm); $text = $page->text(); $text->font($font1, 28); $text->translate(5/mm ,280/mm); # A quick hack to handle a couple of special chars #$f2 =~ s/\303\251/\351/g; # e-acute #$f2 =~ s/\303\272/\372/g; # u-acute $text->text("PDF Output:$f1=$f2=$f3"); $pdf->saveas('utf8_test1.pdf'); } print <<EOF; Content-Type: text/html; charset=utf-8\n <!DOCTYPE html> <html lang='en-NZ'> <head> <title>Test UTF-8</title> <meta charset='UTF-8'> </head> <body> <form method='post'> Input: <input type='text' name='f1' value='$f1'> <br> <input type='submit' name='submit' value='Submit'> <br> Output f2: $f2 <br> Output f3: $f3 <br> $lengths <br> $cmp </form> </body> </html> EOF [download] This is what I see on the webpage after I submit "Cliché.": Input: Cliché. Submit Output f2: Cliché. Output f3: ClichÃ©. Lengths: f1=8, f2=8, f3=10 f1=f2, f1<>f3 And the PDF ends up containing this: PDF Output:ClichÃ©.=ClichÃ©.=ClichÃƒÂ©. As you can see, none of those 3 came out right in the PDF, and the $f3 looks extra long, as if it's been double-encoded or something. Check this octal dump out: $ od -c utf8_test1.out 0000000 C l i c h 303 251 . $ od -c utf8_test2.out 0000000 C l i c h 303 203 302 251 . Any ideas? Thanks. tel2	[reply] [d/l]
Re^3: Write special chars to PDF. UTF8? by poj (Abbot) on Feb 13, 2016 at 09:44 UTC
Try using `decode()` for the pdf #!/perl use strict; use warnings; use CGI; use CGI::Carp 'fatalsToBrowser'; use PDF::API2; use Encode; my $cgi = new CGI; my $f1 = $cgi->param('f1'); my $f2 = decode('UTF-8', $f1 ); open OUT,'>','c:/temp/web/pdf.txt' or die; # change path to suit print OUT "$f1 $f2"; close OUT; my $pdf = PDF::API2->new()->mediabox('A4'); my $text = $pdf->page->text; my $font1 = $pdf->corefont('Arial'); $text->font($font1, 36); $text->translate(100,500); $text->text("f1 = $f1"); $text->translate(100,600); $text->text("f2 = $f2"); $pdf->saveas('c:/temp/web/utf8_test1.pdf'); # change path to suit print <<EOF; Content-Type: text/html; charset=UTF-8\n <!DOCTYPE html> <html lang='en-NZ'> <head> <title>Test UTF-8</title> <meta charset="UTF-8"> </head><body> $f1 $f2 <form method="post"> Input: <input type="text" name="f1" value="$f1"><br> <input type="submit" name="submit" value="Submit"> </form></body></html> EOF [download] poj	[reply] [d/l] [select]
Re^4: Write special chars to PDF. UTF8? by tel2 (Pilgrim) on Feb 14, 2016 at 23:18 UTC