Re: Another Unicode/emoji question
by kcott (Archbishop) on Dec 22, 2023 at 06:57 UTC
|
$ perl -E '
use strict;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};
say q{\x{1f436} = }, "\x{1f436}";
say q{\x{1F436} = }, "\x{1F436}";
say q{\N{DOG FACE} = }, "\N{DOG FACE}";
say q{🐶 = }, "🐶";
'
\x{1f436} = 🐶
\x{1F436} = 🐶
\N{DOG FACE} = 🐶
🐶 = 🐶
In HTML, you can use the entities 🐶 (renders as: 🐶)
or 🐶 (renders as: 🐶).
There's potentially other ways to achieve this that I haven't immediately thought of.
| [reply] [d/l] [select] |
|
|
"There's potentially other ways to achieve this that I haven't immediately thought of."
$ alias perlu
alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'
$ perlu 'say chr 0x1f436; say chr 128054;'
🐶
🐶
| [reply] [d/l] |
Re: Another Unicode/emoji question
by cavac (Prior) on Dec 22, 2023 at 08:41 UTC
|
Aside from the answer about the correct Unicode code point by kcott, it seems some calendar systems are more forgiving than others when it comes to encoding.
If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark. I haven't played with ICS calender stuff in many, MANY years, so i'm just guessing. But at least a quick google suggest that a BOM might be required.
| [reply] |
|
|
If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark
Even after applying the encoding from Re^3: Another Unicode/emoji question, it is not displaying, so I need to try BOM next...
Any suggestions on the best way to add the BOM? File::BOM and String::BOM appear only to read the files, not generate them. Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?
| [reply] [d/l] |
|
|
Or is it as simple as writing \x{efbbbf} as the first thing after the HTTP headers?
The string my $str = "\x{efbbbf}"; does not contain the BOM character, it contains U+EFBBBF, which is not valid a valid Unicode character (AFAIK: I believe Unicode only goes to U+1FFFFF U+10FFFF). The string my $str = "\x{feff}"; contains the BOM character.
If you did use the string you suggested, whether with raw mode or with UTF-8 output encoding, you will not get what you thought:
C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\x{efbbbf})"
+ | xxd
Wide character in print at -e line 1.
00000000: f8bb bbae bf .....
C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print
+qq(\x{efbbbf})" | xxd
Code point 0xEFBBBF is not Unicode, may not be portable in print at -e
+ line 1.
00000000: 5c78 7b45 4642 4242 467d \x{EFBBBF}
Neither of those outputs the UTF-8 bytes for the BOM U+FEFF character.
Instead, you either need to manually send the three octets separately in raw mode, or use raw mode and manually encode from a perl string into UTF-8 bytes, or use UTF-8 output encoding and send the U+FEFF character from the string directly:
C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\xef\xbb\xbf
+)" | xxd
00000000: efbb bf ...
C:\Users\Peter> perl -MEncode -e "binmode STDOUT, ':raw'; print Encode
+::encode('UTF-8', qq(\x{feff}));" | xxd
00000000: efbb bf ...
C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print
+qq(\x{feff})" | xxd
00000000: efbb bf ...
Whether or not that would "work" in your use-case is something I don't know: my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body. Though I guess if it's saving the HTTP message body into a file, and then later using that file, maybe the BOM would help. I don't know on that, sorry.
-- warning: Windows quoting used in code blocks; swap quotes around if you're on linux
| [reply] [d/l] [select] |
|
|
|
|
If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark
Thanks. My URL doesn't have a BOM and I hadn't considered the possibility that it might need one!
I have changed the code as suggested by kcott and I'll now wait until tomorrow for Google to hit my calendar endpoint and see if works. If not, another day and I'll look into whether a BOM is the issue. Debugging this is taking forever as I have to wait for Google to refresh which is does approximately once per day.
| [reply] |
|
|
| [reply] [d/l] |
Re: Another Unicode/emoji question
by ikegami (Patriarch) on Dec 22, 2023 at 05:37 UTC
|
| [reply] |
|
|
#!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
use lib "$ENV{'DOCUMENT_ROOT'}/../lib";
use Site::Utils;
use utf8;
use Pawsies;
my $pawsies = Pawsies->new;
my $template = Template->new(INCLUDE_PATH => $Site::Variables::temp
+late_path);
###################
# Control Variables
#
# write to file each time Google calls API
my $debug = 1;
#
# number of days in the past to sync calendar
my $sync = 28;
#
###################
$data{'format'} = 'calendar' unless $data{'format'} eq 'plain';
print "Content-type: text/$data{'fomat'}; charset=utf-8\n\n";
my @day;
my $query = $dbh->prepare("SELECT *, DATE_FORMAT(start, '%Y%m%dT%H%i%s
+') AS dtstart ....");
$query->execute($sync);
while (my $row = $query->fetchrow_hashref) {
$row->{'dog'} = dogName($row->{'idBooking'});
push @day, $row;
}
my $vars = {
'day' => \@day,
};
$template->process("admin/google/ian.tt", $vars)or die $template->erro
+r;
| [reply] [d/l] |
|
|
You forgot to encode.
Simple way to fix:
use open ":std", ":encoding(UTF-8)";
| [reply] [d/l] |
|
|
|
|
|
|
Re: Another Unicode/emoji question
by Anonymous Monk on Jan 15, 2024 at 05:23 UTC
|
About not waiting 24 hours, you can trick google by adding a harmless tag to the url.
Example if your url is: https://example.com/example.ics
Delete it and add a new one: https://example.com/example.ics#1
Then for next test, delete it and add: https://example.com/example.ics#2
And so on. Google will treat them as new URL's to fetch, rather than using cached result
| [reply] |
Template and Unicode (was: Re: Another Unicode/emoji question)
by Bod (Parson) on Dec 28, 2023 at 22:09 UTC
|
As it is a Festive Break 🎄 I've had the opportunity to test this calendar import and find out what is really going on...
I created a simple test script to generate a single calendar entry from noon to 2pm tomorrow. The day after the ICS feed is accessed.
#!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
use lib "$ENV{'DOCUMENT_ROOT'}/../lib";
use open ":std", ":encoding(UTF-8)";
use Site::Utils;
my $template = Template->new(INCLUDE_PATH => $Site::Variables::temp
+late_path);
$data{'format'} = 'calendar' unless $data{'format'} eq 'plain';
print "Content-type: text/$data{'fomat'}; charset=utf-8;\n\n";
#print "\x{feff}"; # BOM
my ($date, $uid) = $dbh->selectrow_array("SELECT DATE_FORMAT(NOW() + I
+NTERVAL 1 DAY, '%Y%m%d'), DATE_FORMAT(NOW(), '%Y-%j-%H%i%s')");
if ($data{'template'}) {
$template->process("admin/google/dogface.tt", $vars)or die $templa
+te->error;
exit;
}
print<<"END";
BEGIN:VCALENDAR
VERSION:2.0
PRODID:Pawsies Calendar 1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
BEGIN:VEVENT
SUMMARY:\x{1f436} Dog Face Test
UID:DFT$uid\@pawsies.uk
SEQUENCE:1
DTSTAMP:${date}T120000
DTSTART:${date}T120000
DTEND:${date}T140000
END:VEVENT
END:VCALENDAR
END
The module Site::Utils provides the database handle $dbh and splits the HTTP query string and puts it into %data.
If the BOM is included, Google Calendar doesn't display the entry at all. With the BOM omitted, the entry is displayed.
But, if we print the ICS data directly from the script, the 🐶 emoji is displayed correctly. If we use Template to handle the printing, instead of 🐶 we get the literal \x{1f436}... So, it appears to be Template that is not printing the Unicode characters.
Try it here:
Printing from script
Printing with Template
Of course, knowing where the problem exists is different to being able to solve it...
Do you have any experience of printing Unicode using Template? | [reply] [d/l] [select] |
|
|
It was several years in the past, but I have used Template with unicode a lot. You can have UTF-8 encoded templates and UTF-8 strings in variables - you just need to declare them consistently.
The very first configuration parameter documented in Template::Manual::Config is ENCODING. I'll quote it here because it is so short:
ENCODING
The ENCODING option specifies the template files' character encoding:
my $template = Template->new({
ENCODING => 'utf8',
});
A template which starts with a Unicode byte order mark (BOM) will have its encoding detected automatically.
So, the following program works as I'd expect:
use Template;
use open ":std", ":encoding(UTF-8)";
my $dogface = "\N{DOG FACE}";
my $template = <<"ENDOFTEMPLATE";
Dog face from template: $dogface
Dog face from variable: [% dogface %]
ENDOFTEMPLATE
my $tt = Template->new(ENCODING => 'UTF-8');
$tt->process(\$template,{dogface => $dogface});
The automatic BOM detection is a handy band-aid if you have a mixture of Latin1 and UTF-8 encoded templates and don't want to re-code them. | [reply] [d/l] [select] |
|
|
Template is usually an HTML document, not a Perl source code, so Perl escapes don't work there. HTML entities do, though. You can always pass a constant from Perl to a template, too.
#!/usr/bin/perl
use warnings;
use strict;
use open OUT => ':encoding(UTF-8)', ':std';
use Template;
'Template'->new->process(\'SUMMARY:[% chr(128054) %] [% dog %] ὃ
+6; Dog Face Test',
{dog => "\x{1f436}",
chr => \&CORE::chr});
Update: Added the chr sub.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |