in reply to Hacker News titles using U+2013 EN DASH

How about converting it to just a plain ascii hyphen \x{2D}? Content will still be the same but more accessible and smaller.

Replies are listed 'Best First'.
Re^2: Hacker News titles using U+2013 EN DASH
by jdporter (Paladin) on Jan 11, 2024 at 15:41 UTC

    Here are the conversions currently as implemented:

    s/\x{2013}/–/g; s/\x{2019}/'/g; s/\x{201C}/"/g; s/\x{201D}/"/g;

    This seems like a suboptimal approach to me. Does anyone have any better ideas?

    Today's latest and greatest software contains tomorrow's zero day exploits.

      This is a response to what you have here plus other posts throughout this thread.

      • Slashdot nodelet and HackerNews nodelet both have content now. (ref. #11156828)
      • I checked U+2013 EN DASH in a number of places: all seem to be rendered correctly. (ref. OP)
      • In #11156847 you wrote "I needed s/\x{2016}/.../.": that's a "" character. U+2016 DOUBLE VERTICAL LINE may be needed but, from the context, and the fact that this is only mentioned once, I wondered if this might be a typo.
      • U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are the pair "" & "". Instead of converting both to """, perhaps using "“" & "”" might be a better option. (ref. #11156847 and #11156884)
      • U+2019 RIGHT SINGLE QUOTATION MARK is perhaps being used as a fancy apostrophe; I'm not seeing an example at the time of writing. Similar to the last dot point, you might want to proactively address the potential U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK, being the pair "" & "". And in the same vein, "‘" & "’" might be better options. (ref. #11156847 and #11156884)

      Rather than a whole bank of individual s///g, each of which needs to be run for every string, I'd be more inclined to use a lookup table and a single s///g, which only needs to be run once for every string. Something along these lines:

      $ perl -Mutf8 -C -E '
          my %ent_for_char = (
              "\x{2013}" => "–",
              "\x{2018}" => "‘",
              "\x{2019}" => "’",
              "\x{201c}" => "“",
              "\x{201d}" => "”",
          );
      
          my $test_str = "“fancy double” – ‘fancy single’ – fancy’apostrophe";
          say $test_str;
          $test_str =~ s/(.)/exists $ent_for_char{$1} ? $ent_for_char{$1} : $1/eg;
          say $test_str;
      '
      “fancy double” – ‘fancy single’ – fancy’apostrophe
      “fancy double” – ‘fancy single’ – fancy’apostrophe
      

      You can modify the table (e.g. add "\x{2014}" => "—",) without requiring any changes to the code doing the processing.

      — Ken

        Thanks! Those are great ideas. I will take them. :-)

        You were right about 2016, that was a typo.