Update: Fixed a code typo that Abigail-II pointed out, thus making those comments seem a bit odd if you hadn't seen the typo.

Update 2: You can download a slightly updated version of the module. This version does not emit any warnings and I fixed a bug in the tests for Regexp::Token::HTML.

In my continuing quest to make regular expressions that match tokens, I've created the Regex::Token module. The above link is to the download and not to the CPAN as it's too early even for the CPAN. It throws tons of warnings, the documentation is not complete and I need to change the interface to allow different matching semantics from different tokens in the same class (if desired).

A short example of how this module is used:

 my $p_token = Regexp::Token::HTML->create_token('<p name="" class="">
+');
 my $p_tag   = Regexp::Token->create($p_token);
  
 $html = <<END_HTML;
 <h1>testing</h1>
 <p name="goo" class="ber"> <p CLASS=baz name='easy'> 
 <h1>end test</h1>
 END_HTML
 my ($result) = $html =~ /((?:$p_tag )+)/;

 my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'>
+ };
 is($result, $two_tags, '... and we should be able to capture token te
+xt');
[download]

The above code actually works and is included in the tests, though it throws tons of warnings. Feel free to download the module, hack on it and tell me what you think.

POD follows. I'm leery of writing much more documentation until I get the interface stable, but reading the code and the tests, combined with the POD below should clear things up (though this is some pretty strange code).

NAME
SYNOPSIS
ABSTRACT
DESCRIPTION

Token Interface
Sample using HTML tokens
EXPORT

SEE ALSO
AUTHOR
COPYRIGHT AND LICENSE
BUGS
CAVEATS

NAME

Regexp::Token - Perl extension for matching tokens instead of characters

SYNOPSIS

 my $regex = Regexp::Token->create($token);
 my $text  =~ /foo(${regex})bar/;
 print $1;

ABSTRACT

This module allows the programmer to create arbitrary tokens and match them using regular expressions. Requires Perl 5.6 or better;

DESCRIPTION

Token Interface

Regexp::Token requires a token to be passed to its create method. This token must have at three methods that can be called on it:

to_string()

exact

identifier

create_token

new

identifier

to_string

Really, it's simpler than it looks.

Sample using HTML tokens

With this package is bundled Regexp::Token::HTML. This will create tokens for HTML tags. These tokens conform to the previously described interface. The identifer() method currently returns a string with the type of HTML tag followed by the attributes in a sorted order and lower-case. For example:

 <input type="text" NAME="foobar">

Will return the following identifier:

 $identifier eq 'input name type';

Thus, every HTML ``input'' tag with type and name attributes (and no other attributes), will be considered identical.

Here's a sample that will match a paragraph tag (values are not supplied because they are superflous):

  my $p_token = Regexp::Token::HTML->create_token('<p name="" class=""
+>');
  my $p_tag   = Regexp::Token->create($p_token);

  my $html = <<END_HTML;
  <h1>testing</h1>
  <p name="goo" class="ber"> <p CLASS=baz name='easy'>
  <h1>end test</h1>
  END_HTML
  my ($result) = $html =~ /((?:$p_tag )+)/;

  my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'
+> };
  is($result, $two_tags, '... and we should be able to capture token t
+ext');
[download]

Currently, it does just that, but it throws plenty of warnings in the process. I'll fix those later.

EXPORT

None by default.

AUTHOR

Curtis ``Ovid'' Poe, <poec@yahoo.com>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

BUGS

Lots, I'm sure. Email me.

CAVEATS

This is very alpha software. Its interface will change.

Further, this module uses fork() internally because of the possibility that the tokens supplied to the create() method might call regular expressions themselves. If this happens while embedded in a regex, you'll screw up the first regex, thus forcing me to fork a new process to ensure that the extra regexes don't clash.

Further, the documentation isn't complete. See the tests for more info.

Cheers,
Ovid

New address of my CGI Course.

Comment on Regexp::Token -- Use regular expressions to match tokens Select or Download Code

Replies are listed 'Best First'.
Re: Regexp::Token -- Use regular expressions to match tokens by Abigail-II (Bishop) on Aug 25, 2003 at 08:04 UTC
`my $p_token = Regexp::Token::HTML->create_token('<p name="" class="">' +); my $p_tag = Regexp::Token->create($p_tag);` [download] Is the last line a typo and should the argument to `Regexp::Token::create` be `$p_token` or is there some voodoo magic in `Regexp::Token`? Abigail	[reply] [d/l]
Re: Re: Regexp::Token -- Use regular expressions to match tokens by Ovid (Cardinal) on Aug 25, 2003 at 13:18 UTC
You're right, that's just me making a bit of a typo. It was in the docs and the tests, but I've fixed both and uploaded a new distribution. Thanks for the catch. Cheers, Ovid New address of my CGI Course.	[reply]