Many Hacker News titles use U+2013 EN DASH. When listed in the "HackerNews nodelet" these are rendering as "–" instead of "". Would it be possible to convert these?

I know nothing of the code behind the scenes. Here's a couple of potential options:

$ perl -Mutf8 -C -E '
    my $HN_title = "Show HN: Auto Wiki – Turn your codebase into a Wiki";
    say $HN_title;
    say $HN_title =~ s/–/\N{EN DASH}/gr;
    say $HN_title =~ s/–/–/gr;
'
Show HN: Auto Wiki – Turn your codebase into a Wiki
Show HN: Auto Wiki – Turn your codebase into a Wiki
Show HN: Auto Wiki – Turn your codebase into a Wiki

Using – is possibly the better option; but I'm just guessing.

— Ken

Replies are listed 'Best First'.
Re: Hacker News titles using U+2013 EN DASH
by hippo (Archbishop) on Jan 09, 2024 at 10:19 UTC

    Tangentially, does anyone know why they are using this character in preference to the universally unproblematic HYPHEN-MINUS (0x2D) in the first place?


    🦛

      Why not?
        perl -MEncode -e 'print "because it idiotically takes ".length(encode_utf8("\x{2013}"))." bytes to say what ".length(encode_utf8("\x{2d}"))." byte can say as clearly, dummy\n."'
Re: Hacker News titles using U+2013 EN DASH
by jdporter (Paladin) on Jan 09, 2024 at 15:33 UTC

    I've put in a 'fix' - won't know if it works until HN puts out a title with the offensive string in it again. Thanks!

      Thanks. Both Slashdot nodelet and HackerNews nodelet are rendering without content. I'll keep monitoring.

      — Ken

        Thanks. I hadn't done the character conversion correctly. I needed s/\x{2016}/.../. There were three other characters I found in recent feed as well: 2019, 201C, and 201D. If you see any others needing to be converted, please drop me a note. Thanks!

Re: Hacker News titles using U+2013 EN DASH
by bliako (Abbot) on Jan 11, 2024 at 13:49 UTC

    How about converting it to just a plain ascii hyphen \x{2D}? Content will still be the same but more accessible and smaller.

      Here are the conversions currently as implemented:

      s/\x{2013}/–/g; s/\x{2019}/'/g; s/\x{201C}/"/g; s/\x{201D}/"/g;

      This seems like a suboptimal approach to me. Does anyone have any better ideas?

      Today's latest and greatest software contains tomorrow's zero day exploits.

        This is a response to what you have here plus other posts throughout this thread.

        • Slashdot nodelet and HackerNews nodelet both have content now. (ref. #11156828)
        • I checked U+2013 EN DASH in a number of places: all seem to be rendered correctly. (ref. OP)
        • In #11156847 you wrote "I needed s/\x{2016}/.../.": that's a "" character. U+2016 DOUBLE VERTICAL LINE may be needed but, from the context, and the fact that this is only mentioned once, I wondered if this might be a typo.
        • U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are the pair "" & "". Instead of converting both to """, perhaps using "“" & "”" might be a better option. (ref. #11156847 and #11156884)
        • U+2019 RIGHT SINGLE QUOTATION MARK is perhaps being used as a fancy apostrophe; I'm not seeing an example at the time of writing. Similar to the last dot point, you might want to proactively address the potential U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK, being the pair "" & "". And in the same vein, "‘" & "’" might be better options. (ref. #11156847 and #11156884)

        Rather than a whole bank of individual s///g, each of which needs to be run for every string, I'd be more inclined to use a lookup table and a single s///g, which only needs to be run once for every string. Something along these lines:

        $ perl -Mutf8 -C -E '
            my %ent_for_char = (
                "\x{2013}" => "–",
                "\x{2018}" => "‘",
                "\x{2019}" => "’",
                "\x{201c}" => "“",
                "\x{201d}" => "”",
            );
        
            my $test_str = "“fancy double” – ‘fancy single’ – fancy’apostrophe";
            say $test_str;
            $test_str =~ s/(.)/exists $ent_for_char{$1} ? $ent_for_char{$1} : $1/eg;
            say $test_str;
        '
        “fancy double” – ‘fancy single’ – fancy’apostrophe
        “fancy double” – ‘fancy single’ – fancy’apostrophe
        

        You can modify the table (e.g. add "\x{2014}" => "—",) without requiring any changes to the code doing the processing.

        — Ken