I'm producing a new hierarchy of modules,
YAPE::, which stands for "Yet Another Parser/Extractor". These are modules that are designed for both tokenizing and tree-building; that way, you can both parse the text as you go through it, as well as go through the tree extracting specific elements, when the text has been gone through completely.
The first
YAPE:: module I wrote was ("Captain, all power diverted to shields!")
YAPE::HTML, which was my attempt (and it is, in fact, a successful attempt) at producing a dependable HTML parser and tree-builder. It is quite robust, and allows you to extract tags
simply, given any list of criteria:
use YAPE::HTML;
use strict;
my ($some_html,$p,$ext) = ("<html>...</html>");
$p = YAPE::HTML->new($some_html);
$ext = $p->extract(a => ['href'], img => ['src']);
while (my $tag = $ext->()) {
# an <A> tag with an HREF attribute, or
# an <IMG> tag with an SRC attribute
}
$ext = $p->extract(qr/^h[1-3]$/ => []);
while (my $tag = $ext->()) {
# an <H1>, <H2>, or <H3>
}
They also allow you to intercept the tree-building process:
use YAPE::HTML;
use strict;
my ($some_html,$p,$ext) = ("<html>...</html>");
$p = YAPE::HTML->new($some_html);
while (my $chunk = $p->next) {
if ($chunk->type('tag') and $chunk->tag('cs')) {
print "<tt><font color='#0000ff'>";
}
elsif ($chunk->type('closetag') and $chunk->tag('cs')) {
print "</font></tt>";
}
else { print $chunk->string }
}
And presto, you have a custom-tag filter, replacing the bogus
<CS> tag with other, real tags.
The next module I wrote was
YAPE::Regex (with the recently half-completed
YAPE::Regex::Reverse being written in the past couple days). This has already gained some "infamy", and is being used by a couple people here and there.
Then came
YAPE::MathExpr. This isn't done yet (not near being done, actually). Following its completion, I will have a
YAPE::MathExpr::Derive, for all your mathematical derivation needs.
And now I've completed
YAPE::POD. (Yes, I know there is a
Pod:: hierarchy of modules, but I was on a roll, and it was also something I've been meaning to write for a long time.) It creates a tree of the POD (and also returns the elements one at a time, if you'd like), and allows you to do extractions similarly to
YAPE::HTML. It also allows you to make your own breed of POD -- your filter uses a
YAPE::POD object and then goes through displaying each node however you'd like, which means that you can make
N<...> mean something. You can also make
N<...> raise an error, like the standard
YAPE::POD:: extensions will (since it's not valid markup).
If you want to scoff at me for doing this, go right ahead. Give me a
-- if you really feel the need. But I find it very useful to have a group of parsing modules that behave the same way, and have a very similar syntax -- they are all
YAPE:: modules, so they look and feel the same. (I do, however, accept "why haven't you documented the damn things yet?" scoffing, since I haven't documented the damn things yet.)
The modules are currently being developed on my laptop, but I upload the current (not necessarily working) version to my web site every now and then:
http://www.pobox.com/~japhy/YAPE/. If you're interested in helping, or if you have a suggestion for a
YAPE:: module, or if you have any questions, comments, concerns, or flaming criticisms, I'd be glad to hear from you.
Disclaimer: maybe I'll be the only person who uses these modules, but I don't particularly care -- I'm not trying to disrupt the current modules, I'm just creating another way to do it, in the name of Perl fun. And besides, I'm learning from doing this.
Thanks for your time.
japhy --
Perl and Regex Hacker