Tangentially, does anyone know why they are using this character in preference to the universally unproblematic HYPHEN-MINUS (0x2D) in the first place?
| [reply] [d/l] |
perl -MEncode -e 'print "because it idiotically takes ".length(encode_utf8("\x{2013}"))." bytes to say what ".length(encode_utf8("\x{2d}"))." byte can say as clearly, dummy\n."'
| [reply] [d/l] |
I've put in a 'fix' - won't know if it works until HN puts out a title with the offensive string in it again. Thanks!
| [reply] |
| [reply] |
Thanks. I hadn't done the character conversion correctly. I needed s/\x{2016}/.../.
There were three other characters I found in recent feed as well: 2019, 201C, and 201D.
If you see any others needing to be converted, please drop me a note. Thanks!
| [reply] [d/l] |
How about converting it to just a plain ascii hyphen \x{2D}? Content will still be the same but more accessible and smaller.
| [reply] [d/l] |
s/\x{2013}/–/g;
s/\x{2019}/'/g;
s/\x{201C}/"/g;
s/\x{201D}/"/g;
This seems like a suboptimal approach to me. Does anyone have any better ideas?
Today's latest and greatest software contains tomorrow's zero day exploits .
| [reply] [d/l] |
This is a response to what you have here plus other posts throughout this thread.
-
Slashdot nodelet and HackerNews nodelet both have content now. (ref. #11156828)
-
I checked U+2013 EN DASH in a number of places: all seem to be rendered correctly. (ref. OP)
-
In #11156847 you wrote "I needed s/\x{2016}/.../.": that's a "‖" character.
U+2016 DOUBLE VERTICAL LINE may be needed but, from the context, and the fact that this is only mentioned once,
I wondered if this might be a typo.
-
U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK
are the pair "“" & "”".
Instead of converting both to """, perhaps using "“" & "”" might be a better option.
(ref. #11156847 and #11156884)
-
U+2019 RIGHT SINGLE QUOTATION MARK is perhaps being used as a fancy apostrophe;
I'm not seeing an example at the time of writing.
Similar to the last dot point, you might want to proactively address the potential
U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK,
being the pair "‘" & "’".
And in the same vein, "‘" & "’" might be better options.
(ref. #11156847 and #11156884)
Rather than a whole bank of individual s///g, each of which needs to be run for every string,
I'd be more inclined to use a lookup table and a single s///g, which only needs to be run once for every string.
Something along these lines:
$ perl -Mutf8 -C -E '
my %ent_for_char = (
"\x{2013}" => "–",
"\x{2018}" => "‘",
"\x{2019}" => "’",
"\x{201c}" => "“",
"\x{201d}" => "”",
);
my $test_str = "“fancy double” – ‘fancy single’ – fancy’apostrophe";
say $test_str;
$test_str =~ s/(.)/exists $ent_for_char{$1} ? $ent_for_char{$1} : $1/eg;
say $test_str;
'
“fancy double” – ‘fancy single’ – fancy’apostrophe
“fancy double” – ‘fancy single’ – fancy’apostrophe
You can modify the table (e.g. add "\x{2014}" => "—",)
without requiring any changes to the code doing the processing.
| [reply] [d/l] [select] |