New Line of Modules

I'm producing a new hierarchy of modules, YAPE::, which stands for "Yet Another Parser/Extractor". These are modules that are designed for both tokenizing and tree-building; that way, you can both parse the text as you go through it, as well as go through the tree extracting specific elements, when the text has been gone through completely.

The first YAPE:: module I wrote was ("Captain, all power diverted to shields!") YAPE::HTML, which was my attempt (and it is, in fact, a successful attempt) at producing a dependable HTML parser and tree-builder. It is quite robust, and allows you to extract tags simply, given any list of criteria:

use YAPE::HTML;
use strict;
my ($some_html,$p,$ext) = ("<html>...</html>");

$p = YAPE::HTML->new($some_html);

$ext = $p->extract(a => ['href'], img => ['src']);
while (my $tag = $ext->()) {
  # an <A> tag with an HREF attribute, or
  # an <IMG> tag with an SRC attribute
}

$ext = $p->extract(qr/^h[1-3]$/ => []);
while (my $tag = $ext->()) {
  # an <H1>, <H2>, or <H3>
}
[download]

They also allow you to intercept the tree-building process:

use YAPE::HTML;
use strict;
my ($some_html,$p,$ext) = ("<html>...</html>");

$p = YAPE::HTML->new($some_html);
while (my $chunk = $p->next) {
  if ($chunk->type('tag') and $chunk->tag('cs')) {
    print "<tt><font color='#0000ff'>";
  }
  elsif ($chunk->type('closetag') and $chunk->tag('cs')) {
    print "</font></tt>";
  }
  else { print $chunk->string }
}
[download]

And presto, you have a custom-tag filter, replacing the bogus <CS> tag with other, real tags.

The next module I wrote was YAPE::Regex (with the recently half-completed YAPE::Regex::Reverse being written in the past couple days). This has already gained some "infamy", and is being used by a couple people here and there.

Then came YAPE::MathExpr. This isn't done yet (not near being done, actually). Following its completion, I will have a YAPE::MathExpr::Derive, for all your mathematical derivation needs.

And now I've completed YAPE::POD. (Yes, I know there is a Pod:: hierarchy of modules, but I was on a roll, and it was also something I've been meaning to write for a long time.) It creates a tree of the POD (and also returns the elements one at a time, if you'd like), and allows you to do extractions similarly to YAPE::HTML. It also allows you to make your own breed of POD -- your filter uses a YAPE::POD object and then goes through displaying each node however you'd like, which means that you can make N<...> mean something. You can also make N<...> raise an error, like the standard YAPE::POD:: extensions will (since it's not valid markup).

If you want to scoff at me for doing this, go right ahead. Give me a -- if you really feel the need. But I find it very useful to have a group of parsing modules that behave the same way, and have a very similar syntax -- they are all YAPE:: modules, so they look and feel the same. (I do, however, accept "why haven't you documented the damn things yet?" scoffing, since I haven't documented the damn things yet.)

The modules are currently being developed on my laptop, but I upload the current (not necessarily working) version to my web site every now and then: http://www.pobox.com/~japhy/YAPE/. If you're interested in helping, or if you have a suggestion for a YAPE:: module, or if you have any questions, comments, concerns, or flaming criticisms, I'd be glad to hear from you.

Disclaimer: maybe I'll be the only person who uses these modules, but I don't particularly care -- I'm not trying to disrupt the current modules, I'm just creating another way to do it, in the name of Perl fun. And besides, I'm learning from doing this.

Thanks for your time.

japhy -- Perl and Regex Hacker

Comment on New Line of Modules Select or Download Code

Replies are listed 'Best First'.
Re: New Line of Modules by Dominus (Parson) on Jan 06, 2001 at 22:29 UTC
Says japhy: `YAPE::HTML`, which was my attempt at producing a dependable HTML parser and tree-builder. Why would I want to use `YAPE::HTML` instead of `HTML::TreeBuilder`?	[reply]
Re: Re: New Line of Modules by japhy (Canon) on Jan 07, 2001 at 01:32 UTC
Using my module, I can reproduce the HTML I was given and supress the printing of certain tags, and print tags only to a given level: `use YAPE::HTML; $p = YAPE::HTML->new($CONTENT); 1 while $p->next; # to build the tree @exclude = qw( a img ); $level = 2; for ($p->root) { # top-level elements print $_->fullstring(\@exclude, $level); }` [download] I'm adding a feature to change it from allowing an exclude list to allowing either an exclude list or an "exclude all except..." list. If the previous code was given: `<b>Hi <i>there <s>folks!</s></i></b> <br><br> Visit my <img src="naked_woman.jpg"><a href="http://www.pornsite.com/">porn site</a>!` [download] then output would be: `<b>Hi <i>there folks!</i></b> <br><br> Visit my porn site!` [download] The `<s>` tag was removed because it was 3 layers down, and I requested only 2 layers. `japhy` -- Perl and Regex Hacker	[reply] [d/l] [select]