Update: Fixed a code typo that Abigail-II pointed out, thus making those comments seem a bit odd if you hadn't seen the typo.

Update 2: You can download a slightly updated version of the module. This version does not emit any warnings and I fixed a bug in the tests for Regexp::Token::HTML.

In my continuing quest to make regular expressions that match tokens, I've created the Regex::Token module. The above link is to the download and not to the CPAN as it's too early even for the CPAN. It throws tons of warnings, the documentation is not complete and I need to change the interface to allow different matching semantics from different tokens in the same class (if desired).

A short example of how this module is used:

my $p_token = Regexp::Token::HTML->create_token('<p name="" class=""> +'); my $p_tag = Regexp::Token->create($p_token); $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy'> + }; is($result, $two_tags, '... and we should be able to capture token te +xt');

The above code actually works and is included in the tests, though it throws tons of warnings. Feel free to download the module, hack on it and tell me what you think.

POD follows. I'm leery of writing much more documentation until I get the interface stable, but reading the code and the tests, combined with the POD below should clear things up (though this is some pretty strange code).


NAME

Regexp::Token - Perl extension for matching tokens instead of characters


SYNOPSIS

 my $regex = Regexp::Token->create($token);
 my $text  =~ /foo(${regex})bar/;
 print $1;


ABSTRACT

This module allows the programmer to create arbitrary tokens and match them using regular expressions. Requires Perl 5.6 or better;


DESCRIPTION

Token Interface

Regexp::Token requires a token to be passed to its create method. This token must have at three methods that can be called on it:

Really, it's simpler than it looks.

Sample using HTML tokens

With this package is bundled Regexp::Token::HTML. This will create tokens for HTML tags. These tokens conform to the previously described interface. The identifer() method currently returns a string with the type of HTML tag followed by the attributes in a sorted order and lower-case. For example:

 <input type="text" NAME="foobar">

Will return the following identifier:

 $identifier eq 'input name type';

Thus, every HTML ``input'' tag with type and name attributes (and no other attributes), will be considered identical.

Here's a sample that will match a paragraph tag (values are not supplied because they are superflous):

my $p_token = Regexp::Token::HTML->create_token('<p name="" class="" +>'); my $p_tag = Regexp::Token->create($p_token); my $html = <<END_HTML; <h1>testing</h1> <p name="goo" class="ber"> <p CLASS=baz name='easy'> <h1>end test</h1> END_HTML my ($result) = $html =~ /((?:$p_tag )+)/; my $two_tags = q{<p name="goo" class="ber"> <p CLASS=baz name='easy' +> }; is($result, $two_tags, '... and we should be able to capture token t +ext');

Currently, it does just that, but it throws plenty of warnings in the process. I'll fix those later.

EXPORT

None by default.


SEE ALSO


AUTHOR

Curtis ``Ovid'' Poe, <poec@yahoo.com>


COPYRIGHT AND LICENSE

Copyright 2003 by Curtis ``Ovid'' Poe

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


BUGS

Lots, I'm sure. Email me.


CAVEATS

This is very alpha software. Its interface will change.

Further, this module uses fork() internally because of the possibility that the tokens supplied to the create() method might call regular expressions themselves. If this happens while embedded in a regex, you'll screw up the first regex, thus forcing me to fork a new process to ensure that the extra regexes don't clash.

Further, the documentation isn't complete. See the tests for more info.

Cheers,
Ovid

New address of my CGI Course.


In reply to Regexp::Token -- Use regular expressions to match tokens by Ovid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.