Update: Fixed a code typo that Abigail-II pointed out, thus making those comments seem a bit odd if you hadn't seen the typo.
Update 2: You can download a slightly updated version of the module. This version does not emit any warnings and I fixed a bug in the tests for Regexp::Token::HTML.
In my continuing quest to make regular expressions that match tokens, I've created the Regex::Token module. The above link is to the download and not to the CPAN as it's too early even for the CPAN. It throws tons of warnings, the documentation is not complete and I need to change the interface to allow different matching semantics from different tokens in the same class (if desired).
A short example of how this module is used:
my $p_token = Regexp::Token::HTML->create_token('<p name="" class=""> +'); my $p_tag = Regexp::Token->create($p_token); $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'> + }; is($result, $two_tags, '... and we should be able to capture token te +xt');
The above code actually works and is included in the tests, though it throws tons of warnings. Feel free to download the module, hack on it and tell me what you think.
POD follows. I'm leery of writing much more documentation until I get the interface stable, but reading the code and the tests, combined with the POD below should clear things up (though this is some pretty strange code).
Regexp::Token - Perl extension for matching tokens instead of characters
my $regex = Regexp::Token->create($token);
my $text =~ /foo(${regex})bar/;
print $1;
Regexp::Token requires a token to be passed to its create method. This token must have at three methods that can be called on it:
Really, it's simpler than it looks.
With this package is bundled Regexp::Token::HTML. This will create tokens for HTML tags. These tokens conform to the previously described interface. The identifer() method currently returns a string with the type of HTML tag followed by the attributes in a sorted order and lower-case. For example:
<input type="text" NAME="foobar">
Will return the following identifier:
$identifier eq 'input name type';
Thus, every HTML ``input'' tag with type and name attributes (and no other attributes), will be considered identical.
Here's a sample that will match a paragraph tag (values are not supplied because they are superflous):
my $p_token = Regexp::Token::HTML->create_token('<p name="" class="" +>'); my $p_tag = Regexp::Token->create($p_token); my $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy' +> }; is($result, $two_tags, '... and we should be able to capture token t +ext');
Currently, it does just that, but it throws plenty of warnings in the process. I'll fix those later.
None by default.
Curtis ``Ovid'' Poe, <poec@yahoo.com>
Copyright 2003 by Curtis ``Ovid'' Poe
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Lots, I'm sure. Email me.
This is very alpha software. Its interface will change.
Further, this module uses fork() internally because of the possibility that the tokens supplied to the create() method might call regular expressions themselves. If this happens while embedded in a regex, you'll screw up the first regex, thus forcing me to fork a new process to ensure that the extra regexes don't clash.
Further, the documentation isn't complete. See the tests for more info.
Cheers,
Ovid
New address of my CGI Course.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regexp::Token -- Use regular expressions to match tokens
by Abigail-II (Bishop) on Aug 25, 2003 at 08:04 UTC | |
by Ovid (Cardinal) on Aug 25, 2003 at 13:18 UTC |