Module Design Ideas - HTML Entity Reference

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I started out wanting a name for a very simple module, since none seem to have publicly-available features that are near enough to what I want, but the information itself seems to be present in several if not many modules.

ambrus made me think about other things my simple "reference" might handle (see post Re: What should I call my module?). I'm also wondering about the functionality mentioned by tchrist in his charnames code.

I'm thinking that information about the "entities" should be in one place and usable by all, for any purpose. Why does the (long) list itself have to be duplicated so many times?

The essential characteristic is that "entities exist".

The entity is nothing more than a name for a Unicode character. Everything else having to do with it is attached to the character, and should be something I can find in the Unicode database and related Unicode Perl stuff.

Common uses for the list of entities would be, I would think, to convert to the actual character, or convert to another encoded form such as the ordinal in decimal or in hex, and to simply test whether the name is in fact a valid entity.

So I'm supposing that the most fundamental thing is a map of names to code point numbers. I mean the number itself (an integer), not some string representation of the number in hex or decimal or decorated with some other escape system.

Perhaps mapping from the entity name to a string containing one character is the same thing, but I feel an integer is cleaner. It is one step, using other Perl built-in functions, to any of the common things we want. Starting with a char, the first thing you need to do is convert to a number if you want one of the other forms. The integer is the center of the graph of what forms can be transformed to what.

A brute simple approach is simply to provide a hash. But, it might be more powerful to provide a more abstract function-call API. If it knows all the entities ever used by web browsers, not just the official list, one thing that could be done is to decode entities with the assumption that it meant something to somebody at some point, and find it in the obscure lists. So, a function to look-up could take arguments to indicate what all it should try.

That's the first thing I'm looking for suggestions/discussion about. Consider this rough draft:

sub ordinal
 {
 my ($entity)= @_;  # TBA: options?
 return $W3C_Entities{$entity};
 # TODO:  look in other lists if options indicate
 }

sub character
 {
 my $ord= ordinal (@_);
 return chr($ord);
 }

sub hex
 {
 my $ord= ordinal (@_);
 return sprintf ("%04x", $ord);
 }
[download]

Asking for the exact result wanted means it can abstract out how the information is actually stored (only stored once). A function could easily take a list of things to process all at once and return a list of results, and it could take options some how, indicating which lists to consider and how to deal with errors. Both features together might make the function rather complex, and I'm hesitant to write a complex function to do something as simple as a hash look up! Do I really need to process a list of inputs if the caller can call it in a loop (or map) just as easily? Only if the overhead of calling and processing the options is significant, I might suppose.

The other thing is how to do the inverse. Given a number, how to find the named entity if one exists? I didn't find a bimap hash tie in CPAN. Just creating a reversed hash from the first is easy, but it makes me worry that hashes in Perl are keyed by string, not by integer. The number to be looked up has to be in a canonical form or it won't find it. The function API can massage it and be less error prone than using the hash directly: just use 0+$arg to force numeric first.

Finally, I look at tchrist's module which contains a map with like like this:

"sup1" => "SUPERSCRIPT ONE",
[download]

Why is such a list necessary? The list that "these are the entity names that exist" and "they are aliases for which character" should be sufficient for the entity-related purposes, and the value here, "SUPERSCRIPT ONE", is already be associated with that character in the Perl Unicode information. After all, \N{…} has to look it up somewhere. It's a table join, in database terminology. My proposed "table" is the most normal one. tchrist's map can be generated automatically from that, if it is necessary to cache the results, or looked up with the integer hash followed by a call to charnames::viacode.

So, I'm all ears. How can I make a useful and truly reusable building block?

—John

update: used actual function name supplied by the ever-prolific ikegami.

Comment on Module Design Ideas - HTML Entity Reference Select or Download Code

Replies are listed 'Best First'.
Re: Module Design Ideas - HTML Entity Reference by ikegami (Patriarch) on May 10, 2011 at 12:38 UTC
Is this what you're talking about: `>perl -E"use charnames ':full'; say charnames::viacode(0x00E9)" LATIN SMALL LETTER E WITH ACUTE >perl -E"use charnames ':full'; say sprintf '%04X', charnames::vianame +('LATIN SMALL LETTER E WITH ACUTE')" 00E9` [download]	[reply] [d/l]
Re^2: Module Design Ideas - HTML Entity Reference by John M. Dlugosz (Monsignor) on May 10, 2011 at 23:59 UTC
You mean in the last paragraph, concerning tchrist's table? Yes. I updated my post to mention the function name. You mean the main point of the discussion? No, that was a throw-away line. I'm still interested in discussing the possibilities and wisdom of a generally useful way to prevent multiple repeats of the same list.	[reply]
Re^3: Module Design Ideas - HTML Entity Reference by ikegami (Patriarch) on May 11, 2011 at 00:03 UTC
Using existing accessors DOES avoid the need to repeat the list.	[reply]
Re: Module Design Ideas - HTML Entity Reference by Anonymous Monk on May 10, 2011 at 09:54 UTC
Have you seen Unicode::Char?	[reply]
Re^2: Module Design Ideas - HTML Entity Reference by John M. Dlugosz (Monsignor) on May 10, 2011 at 23:53 UTC
I had not seen that. I don't care for it though: Why is `$u->u5c0f;`, after creating `$u`, which generates a different method for every character the first time it is used, any better or easier than just writing `"\x{5c0f}"`? You would only use the literal method name call if you had the character code already as a compile-time literal, so what is the possible point? It also has `$u->name`, which seems to be identical to using \N{}. So why did you mention this module? Are you just mentioning things that show up on a keyword search without seeing what they actually are (which is what it gives the impression of; sorry if that's not the case), or is there something you want to discuss about this approach or the lessons it can teach? If that's the case, please elaborate.	[reply] [d/l] [select]