I started out wanting a name for a very simple module, since none seem to have publicly-available features that are near enough to what I want, but the information itself seems to be present in several if not many modules.

ambrus made me think about other things my simple "reference" might handle (see post Re: What should I call my module?). I'm also wondering about the functionality mentioned by tchrist in his charnames code.

I'm thinking that information about the "entities" should be in one place and usable by all, for any purpose. Why does the (long) list itself have to be duplicated so many times?

The essential characteristic is that "entities exist".

The entity is nothing more than a name for a Unicode character. Everything else having to do with it is attached to the character, and should be something I can find in the Unicode database and related Unicode Perl stuff.

Common uses for the list of entities would be, I would think, to convert to the actual character, or convert to another encoded form such as the ordinal in decimal or in hex, and to simply test whether the name is in fact a valid entity.

So I'm supposing that the most fundamental thing is a map of names to code point numbers. I mean the number itself (an integer), not some string representation of the number in hex or decimal or decorated with some other escape system.

Perhaps mapping from the entity name to a string containing one character is the same thing, but I feel an integer is cleaner. It is one step, using other Perl built-in functions, to any of the common things we want. Starting with a char, the first thing you need to do is convert to a number if you want one of the other forms. The integer is the center of the graph of what forms can be transformed to what.

A brute simple approach is simply to provide a hash. But, it might be more powerful to provide a more abstract function-call API. If it knows all the entities ever used by web browsers, not just the official list, one thing that could be done is to decode entities with the assumption that it meant something to somebody at some point, and find it in the obscure lists. So, a function to look-up could take arguments to indicate what all it should try.

That's the first thing I'm looking for suggestions/discussion about. Consider this rough draft:

sub ordinal { my ($entity)= @_; # TBA: options? return $W3C_Entities{$entity}; # TODO: look in other lists if options indicate } sub character { my $ord= ordinal (@_); return chr($ord); } sub hex { my $ord= ordinal (@_); return sprintf ("%04x", $ord); }
Asking for the exact result wanted means it can abstract out how the information is actually stored (only stored once). A function could easily take a list of things to process all at once and return a list of results, and it could take options some how, indicating which lists to consider and how to deal with errors. Both features together might make the function rather complex, and I'm hesitant to write a complex function to do something as simple as a hash look up! Do I really need to process a list of inputs if the caller can call it in a loop (or map) just as easily? Only if the overhead of calling and processing the options is significant, I might suppose.

The other thing is how to do the inverse. Given a number, how to find the named entity if one exists? I didn't find a bimap hash tie in CPAN. Just creating a reversed hash from the first is easy, but it makes me worry that hashes in Perl are keyed by string, not by integer. The number to be looked up has to be in a canonical form or it won't find it. The function API can massage it and be less error prone than using the hash directly: just use 0+$arg to force numeric first.

Finally, I look at tchrist's module which contains a map with like like this:

"sup1" => "SUPERSCRIPT ONE",
Why is such a list necessary? The list that "these are the entity names that exist" and "they are aliases for which character" should be sufficient for the entity-related purposes, and the value here, "SUPERSCRIPT ONE", is already be associated with that character in the Perl Unicode information. After all, \N{…} has to look it up somewhere. It's a table join, in database terminology. My proposed "table" is the most normal one. tchrist's map can be generated automatically from that, if it is necessary to cache the results, or looked up with the integer hash followed by a call to charnames::viacode.

So, I'm all ears. How can I make a useful and truly reusable building block?

—John

update: used actual function name supplied by the ever-prolific ikegami.


In reply to Module Design Ideas - HTML Entity Reference by John M. Dlugosz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.