Dear friends,

I found the need to decode XML entities that the HTML::Entities module does not know. I found a list of entities at w3.org. I gathered all the definitions and made a module.

It's a bunch of hashes with thousands of records, so I'll not list it here. :-) I have never published a CPAN module before. I'm just going through perlnewmod.

The module is available on my website. Any comments, advices and suggestions are most welcome.

Replies are listed 'Best First'.
Re: RFC: XML::Entities
by hossman (Prior) on Nov 18, 2007 at 02:02 UTC

    at a minimum, you should head jdporter's suggestions in response to the snippet you already posted. particularly since now you have all those chr calls inside a function body, so using the same set twice will redo all the computation.

    Personally: I don't like duplicating data in a new format, you never know when the "authoritative" copy might change. i much prefer to have a tool to translate the data from the authoritative format to the format i want.

    In this case, instead of a module with a bunch of hardcoded data structures -- how about an XML::Entities modules that knows how to parse the raw .ent files to generate the data structures? this would have the added benefit of working with all the various known entity sets and not just the one set you are currently interested in (not to mention, any custom entity sets someone else might make in the future)

    That module could be used by the Makefile.PL of other modules (named things like XML::Entity::ISO8879) to automaticly download the .ent files the perl data structures to write as source code for fast reuse. (or at the very least, you could use a module like i describe at build time to generate modules like the one you've already made easily -- but other people could use it too.)

      Yes, you're definitely right. I did automatize the process of retrieving the entities from the webpages but I parsed it from the .html files actually. :-) And I use bash, wget and perl for it, which I find much more comfortable in this case than pure perl.

      I surrounded the things in subs so it won't get evaluated if someone is only interested in one set and not in others. I should definitely cache the functions' return values. And are you sure it's better to add the semicolons by a map? I didn't really benchmark it, but I think this should be a bit faster and the code is simpler and more transparent, if twice as big. Dunno... it seems to me this is the better way but I am certainly open to counterarguments.

      Update: Looking at the .ent files, I see now that half of my effort was needless. :-) Oh well...

Re: RFC: XML::Entities
by Sixtease (Friar) on Nov 18, 2007 at 18:57 UTC

    OK, I did some of the suggested things. I perlized the download-and-parse script and made it gather the data from the .ent files. I encapsulated the thing into a module that provides a decode function using HTML::Entities inside. The semicolons are now added as jdporter suggested and the results are cached.

    Here is the new version. What do you think?

      In the Pod I notice
      Under perl 5.6 and earlier only characters in the Latin-1 range are replaced.
      which should be probably replaced by perl 5.005, because perl 5.6 can deal with Unicode characters, at least with the chr(0x0100) syntax.

      But later in the code there's require 5.007; so the above sentence could be left completely out.

      I hope you're building a proper CPAN distribution before uploading it? And add some tests?

        As I said, this is gonna be my first CPAN submission, so I may need some guidance. I'll ask once more when I think it's ready for submission. Thank you all.
Re: RFC: XML::Entities
by Sixtease (Friar) on Nov 21, 2007 at 23:23 UTC
    OK I added tests and some documentation. Tell me now please what you think. Is it ready for publishing? New version