John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I'm making a simple module that just contains all the HTML Character Entities, with two-way lookup. What should I call it? HTML::Entity-Reference? But does the dash cause problems? Maybe it should include a '4' somewhere as it's based on Character entity references in HTML 4 and I see on CPAN HTML::HTML5::Parser::NamedEntityList which is not what I want. Did the list of Entities change for HTML5? This module contains some duplicate entries without trailing semicolons (but not all) so I don't know what that's about. It also has all-caps AMP instead of amp.

I'm planning a map of (only correct!) Entity names mapped to character numbers as integers. This gives the simplest access to the information for a variety of uses, including spitting out the char itself (using chr) or a numeric entity (formatting the number as a string or a hex string), and also looking up names based on the ord of a character.

I want to consider issues of global reuse before I let it escape — er, I mean before I release it.

Replies are listed 'Best First'.
Re: What should I call my module?
by GrandFather (Saint) on May 08, 2011 at 03:14 UTC
      One of those contains functions to do the translation but doesn't provide access to a list. One of my uses isn't to translate at all but to validate. The other is emblazed with "do not use".

        One of those (that does not include the notice "DO NOT USE THIS MODULE DIRECTLY") includes the text:

        The module can also export the %char2entity and the %entity2char hashes, which contain the mapping from all characters to the corresponding entities (and vice versa, respectively).
        True laziness is hard work
Re: What should I call my module?
by tchrist (Pilgrim) on May 08, 2011 at 03:27 UTC
    I have a unicore/html_alias.pl (amongst others) to be use in conjunction with the charnames pragma, like this:
    use charnames ":alias" => ":html"; print "\N{pound}\N{sup2}"'
    It’ll probably make it into the 5.15 development cycle. The start of it looks like this:
    ############################################################### # # File "html_alias.pl" containing aliases for use with # # use charnames ":alias" => ":html"; # # Each section in table below is grouped and sorted not by alias # name but rather so one can visually locate a desired character. # # To view table sorted by key, see unused DATA section below. # ############################################################### use utf8; use strict; use warnings qw[ FATAL all ]; # "return" is to quiet perl -wc return ( # Number aliases: these are \p{Other_Number} "sup1" => "SUPERSCRIPT ONE", # ¹ U+00B9 "sup2" => "SUPERSCRIPT TWO", # ² U+00B2 "sup3" => "SUPERSCRIPT THREE", # ³ U+00B3 "frac12" => "VULGAR FRACTION ONE HALF", # ½ U+00BD "frac14" => "VULGAR FRACTION ONE QUARTER", # ¼ U+00BC "frac34" => "VULGAR FRACTION THREE QUARTERS", # ¾ U+00BE # Currency sign aliases: \p{Currency_Symbol} "curren" => "CURRENCY SIGN", # ¤ U+00A4 "cent" => "CENT SIGN", # ¢ U+00A2 "pound" => "POUND SIGN", # £ U+00A3 "yen" => "YEN SIGN", # ¥ U+00A5 "euro" => "EURO SIGN", # € U+20AC
    Does that look useful to you?
      Maybe... I was thinking that the most generally accessible way to have the data would be to map the entity name to an integer. Also have a reverse lookup. For validating, it doesn't matter what's in the value since I'll just check for key existence.

      In order to use your list for anything other than the charnames construct or perhaps displaying a readable form, you would have to look up the value to resolve the actual character or code number.

      I'm all for having only one list of all the Entities stored somewhere for all code to draw upon. Maybe it should present a lookup API that can return any or all of those items and how it's stored internally is opaque.

      What do you think?

Re: What should I call my module?
by JavaFan (Canon) on May 08, 2011 at 12:54 UTC
    What should I call it? HTML::Entity-Reference? But does the dash cause problems?
    If one does:
    use HTML::Entity-Reference;
    this is equivalent to:
    BEGIN { require HTML::Entity; HTML::Entity->import("-Reference"); }
    CamelCase is the usual convention when having module names composed of multiple words:
    package HTML::EntityReference;
    An underscore works as well:
    package HTML::Entity_Reference;
Re: What should I call my module?
by ambrus (Abbot) on May 08, 2011 at 15:46 UTC

    Will this module have an option to support not only standard html entities but also nonstandard historically used ones like [ or ő, some of which almost no modern browsers accept? There's a list of more than two thousand entities in the mozilla source, plus a list of about a thousand entities in the elinks source. I'm just curious.

      See my reply to tchrist. If it uses an API rather than just exporting a hash, it could contain additional parameters to the calls, and be more flexible. If the entities have multiple "authorities" they can be provided as additional arguments.

      Thanks for bringing that up.