Another Unicode/emoji question

Bod has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Another Unicode/emoji question by kcott (Archbishop) on Dec 22, 2023 at 06:57 UTC
G'day Bod, I don't know where you got `\x{e052}` from. That's a codepoint in Unicode PDF Code Chart "Private Use Area (Range: E000-F8FF)". What you want is `U+01F436` which is in Unicode PDF Code Chart "Miscellaneous Symbols and Pictographs (Range: 1F300-1F5FF)". There's a number of ways to generate that character with Perl: $ perl -E ' use strict; use warnings; use utf8; use open OUT => qw{:encoding(UTF-8) :std}; say q{\x{1f436} = }, "\x{1f436}"; say q{\x{1F436} = }, "\x{1F436}"; say q{\N{DOG FACE} = }, "\N{DOG FACE}"; say q{🐶 = }, "🐶"; ' \x{1f436} = 🐶 \x{1F436} = 🐶 \N{DOG FACE} = 🐶 🐶 = 🐶 In HTML, you can use the entities `🐶` (renders as: 🐶) or `🐶` (renders as: 🐶). There's potentially other ways to achieve this that I haven't immediately thought of. — Ken	[reply] [d/l] [select]
Re^2: Another Unicode/emoji question by kcott (Archbishop) on Dec 22, 2023 at 09:37 UTC
"There's potentially other ways to achieve this that I haven't immediately thought of." `$ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'` [download] $ perlu 'say chr 0x1f436; say chr 128054;' 🐶 🐶 — Ken	[reply] [d/l]
Re: Another Unicode/emoji question by cavac (Prior) on Dec 22, 2023 at 08:41 UTC
Aside from the answer about the correct Unicode code point by kcott, it seems some calendar systems are more forgiving than others when it comes to encoding. If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark. I haven't played with ICS calender stuff in many, MANY years, so i'm just guessing. But at least a quick google suggest that a BOM might be required. PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP	[reply]
Re^2: Another Unicode/emoji question by Bod (Parson) on Dec 22, 2023 at 15:30 UTC
If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark Thanks. My URL doesn't have a BOM and I hadn't considered the possibility that it might need one! I have changed the code as suggested by kcott and I'll now wait until tomorrow for Google to hit my calendar endpoint and see if works. If not, another day and I'll look into whether a BOM is the issue. Debugging this is taking forever as I have to wait for Google to refresh which is does approximately once per day.	[reply]
BOM (was: Re^2: Another Unicode/emoji question) by Bod (Parson) on Dec 23, 2023 at 16:46 UTC
If Unicode still makes problems, check (with a hex editor) if the file you generate has a byte order mark Even after applying the encoding from Re^3: Another Unicode/emoji question, it is not displaying, so I need to try BOM next... Any suggestions on the best way to add the BOM? File::BOM and String::BOM appear only to read the files, not generate them. Or is it as simple as writing `\x{efbbbf}` as the first thing after the HTTP headers?	[reply] [d/l]
Re: BOM (was: Re^2: Another Unicode/emoji question) by pryrt (Abbot) on Dec 24, 2023 at 16:10 UTC
Or is it as simple as writing `\x{efbbbf}` as the first thing after the HTTP headers? The string `my $str = "\x{efbbbf}";` does not contain the BOM character, it contains U+EFBBBF, which is not valid a valid Unicode character (AFAIK: I believe Unicode only goes to ~~U+1FFFFF~~ U+10FFFF). The string `my $str = "\x{feff}";` contains the BOM character. If you did use the string you suggested, whether with raw mode or with UTF-8 output encoding, you will not get what you thought: `C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\x{efbbbf})" + \| xxd Wide character in print at -e line 1. 00000000: f8bb bbae bf ..... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{efbbbf})" \| xxd Code point 0xEFBBBF is not Unicode, may not be portable in print at -e + line 1. 00000000: 5c78 7b45 4642 4242 467d \x{EFBBBF}` [download] Neither of those outputs the UTF-8 bytes for the BOM U+FEFF character. Instead, you either need to manually send the three octets separately in raw mode, or use raw mode and manually encode from a perl string into UTF-8 bytes, or use UTF-8 output encoding and send the U+FEFF character from the string directly: `C:\Users\Peter> perl -e "binmode STDOUT, ':raw'; print qq(\xef\xbb\xbf +)" \| xxd 00000000: efbb bf ... C:\Users\Peter> perl -MEncode -e "binmode STDOUT, ':raw'; print Encode +::encode('UTF-8', qq(\x{feff}));" \| xxd 00000000: efbb bf ... C:\Users\Peter> perl -e "use open ':std' => ':encoding(UTF-8)'; print +qq(\x{feff})" \| xxd 00000000: efbb bf ...` [download] Whether or not that would "work" in your use-case is something I don't know: my guess is that it won't help, because anything that's using HTTP headers should be paying attention to the encoding listed in the headers, and not requiring a BOM in the message body. Though I guess if it's saving the HTTP message body into a file, and then later using that file, maybe the BOM would help. I don't know on that, sorry. -- warning: Windows quoting used in code blocks; swap quotes around if you're on linux	[reply] [d/l] [select]
Re^2: BOM (was: Re^2: Another Unicode/emoji question) by Bod (Parson) on Dec 25, 2023 at 21:04 UTC
Re^2: Another Unicode/emoji question by Bod (Parson) on Dec 23, 2023 at 09:51 UTC
It looks like a BOM is the next thing to try... Google has updated the calendar and displayed `\x{1f436}` instead of 🐶	[reply] [d/l]
Re: Another Unicode/emoji question by ikegami (Patriarch) on Dec 22, 2023 at 05:37 UTC
How do you generate and encode the string from the template?	[reply]
Re^2: Another Unicode/emoji question by Bod (Parson) on Dec 22, 2023 at 15:22 UTC
A bit like this... #!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use strict; use warnings; use lib "$ENV{'DOCUMENT_ROOT'}/../lib"; use Site::Utils; use utf8; use Pawsies; my $pawsies = Pawsies->new; my $template = Template->new(INCLUDE_PATH => $Site::Variables::temp +late_path); ################### # Control Variables # # write to file each time Google calls API my $debug = 1; # # number of days in the past to sync calendar my $sync = 28; # ################### $data{'format'} = 'calendar' unless $data{'format'} eq 'plain'; print "Content-type: text/$data{'fomat'}; charset=utf-8\n\n"; my @day; my $query = $dbh->prepare("SELECT *, DATE_FORMAT(start, '%Y%m%dT%H%i%s +') AS dtstart ...."); $query->execute($sync); while (my $row = $query->fetchrow_hashref) { $row->{'dog'} = dogName($row->{'idBooking'}); push @day, $row; } my $vars = { 'day' => \@day, }; $template->process("admin/google/ian.tt", $vars)or die $template->erro +r; [download]	[reply] [d/l]
Re^3: Another Unicode/emoji question by ikegami (Patriarch) on Dec 23, 2023 at 05:46 UTC
You forgot to encode. Simple way to fix: `use open ":std", ":encoding(UTF-8)";` [download]	[reply] [d/l]
Re^4: Another Unicode/emoji question by Bod (Parson) on Dec 23, 2023 at 11:47 UTC
Re^5: Another Unicode/emoji question by kcott (Archbishop) on Dec 24, 2023 at 11:11 UTC
Re^5: Another Unicode/emoji question by ikegami (Patriarch) on Dec 23, 2023 at 14:42 UTC
Template and Unicode (was: Re: Another Unicode/emoji question) by Bod (Parson) on Dec 28, 2023 at 22:09 UTC
As it is a Festive Break 🎄 I've had the opportunity to test this calendar import and find out what is really going on... I created a simple test script to generate a single calendar entry from noon to 2pm tomorrow. The day after the ICS feed is accessed. #!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use strict; use warnings; use lib "$ENV{'DOCUMENT_ROOT'}/../lib"; use open ":std", ":encoding(UTF-8)"; use Site::Utils; my $template = Template->new(INCLUDE_PATH => $Site::Variables::temp +late_path); $data{'format'} = 'calendar' unless $data{'format'} eq 'plain'; print "Content-type: text/$data{'fomat'}; charset=utf-8;\n\n"; #print "\x{feff}"; # BOM my ($date, $uid) = $dbh->selectrow_array("SELECT DATE_FORMAT(NOW() + I +NTERVAL 1 DAY, '%Y%m%d'), DATE_FORMAT(NOW(), '%Y-%j-%H%i%s')"); if ($data{'template'}) { $template->process("admin/google/dogface.tt", $vars)or die $templa +te->error; exit; } print<<"END"; BEGIN:VCALENDAR VERSION:2.0 PRODID:Pawsies Calendar 1.0//EN CALSCALE:GREGORIAN METHOD:PUBLISH BEGIN:VEVENT SUMMARY:\x{1f436} Dog Face Test UID:DFT$uid\@pawsies.uk SEQUENCE:1 DTSTAMP:${date}T120000 DTSTART:${date}T120000 DTEND:${date}T140000 END:VEVENT END:VCALENDAR END [download] The module `Site::Utils` provides the database handle `$dbh` and splits the HTTP query string and puts it into `%data`. If the BOM is included, Google Calendar doesn't display the entry at all. With the BOM omitted, the entry is displayed. But, if we print the ICS data directly from the script, the 🐶 emoji is displayed correctly. If we use Template to handle the printing, instead of 🐶 we get the literal `\x{1f436}`... So, it appears to be Template that is not printing the Unicode characters. Try it here: Printing from script Printing with Template Of course, knowing where the problem exists is different to being able to solve it... Do you have any experience of printing Unicode using Template?	[reply] [d/l] [select]
Re: Template and Unicode (was: Re: Another Unicode/emoji question) by haj (Vicar) on Dec 28, 2023 at 23:31 UTC
It was several years in the past, but I have used Template with unicode a lot. You can have UTF-8 encoded templates and UTF-8 strings in variables - you just need to declare them consistently. The very first configuration parameter documented in Template::Manual::Config is `ENCODING`. I'll quote it here because it is so short: ENCODING The ENCODING option specifies the template files' character encoding: `my $template = Template->new({ ENCODING => 'utf8', });` [download] A template which starts with a Unicode byte order mark (BOM) will have its encoding detected automatically. So, the following program works as I'd expect: `use Template; use open ":std", ":encoding(UTF-8)"; my $dogface = "\N{DOG FACE}"; my $template = <<"ENDOFTEMPLATE"; Dog face from template: $dogface Dog face from variable: [% dogface %] ENDOFTEMPLATE my $tt = Template->new(ENCODING => 'UTF-8'); $tt->process(\$template,{dogface => $dogface});` [download] The automatic BOM detection is a handy band-aid if you have a mixture of Latin1 and UTF-8 encoded templates and don't want to re-code them.	[reply] [d/l] [select]
Re: Template and Unicode (was: Re: Another Unicode/emoji question) by choroba (Cardinal) on Dec 28, 2023 at 22:42 UTC
Template is usually an HTML document, not a Perl source code, so Perl escapes don't work there. HTML entities do, though. You can always pass a constant from Perl to a template, too. `#!/usr/bin/perl use warnings; use strict; use open OUT => ':encoding(UTF-8)', ':std'; use Template; 'Template'->new->process(\'SUMMARY:[% chr(128054) %] [% dog %] &#x1f43 +6; Dog Face Test', {dog => "\x{1f436}", chr => \&CORE::chr});` [download] Update: Added the chr sub. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Another Unicode/emoji question by Anonymous Monk on Jan 15, 2024 at 05:23 UTC
About not waiting 24 hours, you can trick google by adding a harmless tag to the url. Example if your url is: https://example.com/example.ics Delete it and add a new one: https://example.com/example.ics#1 Then for next test, delete it and add: https://example.com/example.ics#2 And so on. Google will treat them as new URL's to fetch, rather than using cached result	[reply]