Munging with a regex dispatch table

Solo has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to solve a problem similar to Precompiled Reg Exps. This is for decoupling the parsing and processing of a mess of log files. The whole process would look like

command(s) | parse.pl | process.pl

where process.pl is one of a few possible processing scripts.

For parse.pl, I'm pondering how to best apply a dispatch table approach. Here is what I have now (simplified example).

use strict;
use warnings;

use YAML;

my $header    = qr/^(?!START:|END:)([^*: ][^:]+):(.+)\n/;
my $starttime = qr/^START:\s*(.+)\n/;
my $endtime   = qr/^END:\s*(.+)\n/;
my $lineitem  = qr/^(?:\* |  )([^:]+):(\d+):(.+)\n/;

my %dispatch = (
    $header    => \&header,
    $starttime => sub { $_[0]->{starttime} = $_[1] },
    $endtime   => sub { $_[0]->{endtime}   = $_[1] },
    $lineitem  => sub { push @{shift->{items}}, \@_ },
);

sub header {
    my $r = shift;
    process_record($r);  # first time will be empty record...
    %$r = ();            # clear the record
    $r->{name} = $_[0];
    $r->{desc} = $_[1];
}

my $r = {}; # 'record'
LINE: while (my $line = <DATA>) {
     for my $re (keys %dispatch) {
         my @m = ();
         if (@m = $line =~ /$re/) {
             $dispatch{$re}->($r, @m);
             next LINE;
         }
     }
}
process_record($r);

sub process_record {
    my $r = shift;
    print YAML::Dump($r);
}

__DATA__
The first record:it has only 1 item
START: Tue Feb  1 00:09:30 2005
END: Tue Feb  1 00:19:32 2005
  Item1:10:comment1
The next record:might have more items
START: Tue Feb  1 00:39:07 2005
END: Tue Feb  1 00:42:46 2005
  Item1:4:comment2
* Item2:1:comment3
[download]

I'm happy to have removed the if/elsif madness, but now the dispatch table definition looks almost as ugly. How would you 'clean' it up? I've considered pulling some bits into a module or building the dispatch table out of a config file. The config file worries me because it would require putting data and code in the config. I don't see how a module could reduce the code substantially since only the dispatch loop is reusable. Thoughts?

--Solo

--
You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.

Comment on Munging with a regex dispatch table Download Code

Replies are listed 'Best First'.
Re: Munging with a regex dispatch table by Tanktalus (Canon) on Feb 01, 2005 at 21:10 UTC
I would first start with the observation that keys are strings - not objects. And qr creates an object. Which happens to stringify into something that can recreate the object. It's DWIMmery, for sure, but ... not quite deterministic. What this looks like to me is actually an AoH: `my @dispatch_table = ( { re => qr/^(?!START:\|END:)([^: ][^:]+):(.+)\n/, action => \&header, }, { re => qr/^START:\s(.+)\n/, action => sub { $_[0]->{starttime} = $_[1] }, }, { re => qr/^END:\s(.+)\n/, action => sub { $_[0]->{endtime} = $_[1] }, }, { re => qr/^(?:\ \| )([^:]+):(\d+):(.+)\n/, action => sub { push @{shift->{items}}, \@_ }, }, );` [download] What this does is put your re and action close together, it also puts the order of execution completely in your control (right now you're relying on keys to give you the order, which may be unrelated to the order you insert them), and you can also put other fields in there for further actions, data, whatever, that you may need to. For example, two re's may go to the same code ref, but you could pass extra args that were different for each, and these extra args could be here. `foreach my $h (@dispatch_table) { if (my @m = $line =~ $h->{re}) { $h->{action}->($r, @m); # or $h->{action}->($r, @m, $h->{args}); next LINE; } }` [download] IMO, that's cleaner and more flexible. YMMV.	[reply] [d/l] [select]
Re^2: Munging with a regex dispatch table by Fletch (Bishop) on Feb 01, 2005 at 21:17 UTC
IMO, that's cleaner and more flexible. YMMV. And also preserves the ordering of the rules, which could be important. E.g. you want to be able to shortcut processing of a record by setting a flag (a la `$File::Find::prune = 1`) it'd be nice to know that the rules you're trying to skip over won't get run before the pruning one just because of the hashing order. Update: Oop, you mentioned execution order. Never mind me. %/	[reply] [d/l]
Re: Munging with a regex dispatch table by Fletch (Bishop) on Feb 01, 2005 at 21:04 UTC
If you're worried about looks you could define a prototyped sub which takes a block, key, and the hash to which to add. `sub add_action (&\%$) { my( $code, $h, $key ) = @_; $h->{$key} = $code; } my %dispatch; add_action { do_something_with_foo } %dispatch => qr/foo/;` [download] As an aside, you might want to look at Tie::RefHash which will keep the compiled regexen as keys rather than stringifying them (they'll get recompiled when you use them `/$re/`, but it's kinda inefficient).	[reply] [d/l] [select]
Re: Munging with a regex dispatch table by nmerriweather (Friar) on Feb 01, 2005 at 22:33 UTC
slightly offtopic, but have you condsidered cleaning up your regexp list as such: `my %REGEX = ( header = qr/^(?!START:\|END:)([^: ][^:]+):(.+)\n/, starttime = qr/^START:\s(.+)\n/, endtime = qr/^END:\s(.+)\n/, lineitem = qr/^(?:\ \| )([^:]+):(\d+):(.+)\n/, );` [download] then you call it as: $REGEX{'header'} functionally its identical, but when editing code later on, i find it way easier to see whats going on	[reply] [d/l]
Re^2: Munging with a regex dispatch table by Anonymous Monk on Feb 02, 2005 at 03:04 UTC
It's not the same thing. He's using the regex as a key with the corresponding code/action as the value to that key. He could write: `my %REGEX = ( header = [qr/^(?!START:\|END:)([^: ][^:]+):(.+)\n/, \&header], starttime = [qr/^START:\s(.+)\n/, \&etc], endtime = [qr/^END:\s(.+)\n/, \&etc], lineitem = [qr/^(?:\ \| )([^:]+):(\d+):(.+)\n/, \&etc] ); #... #You could even use constants here to make things easier. # 0 = RE; 1 = FUNCTION; or something. for my $rex_table (keys %REGEX) { if (my @m = $line =~ $rex_table->[0]) { $rex_table->[1]->($r,@m); next LINE; } }` [download] Which I, personally, think is preferable. The implementation of the dispatch table depends really on implementation. I like to have the name of the action as a 'key', so that the code's self-documenting, but you could also use the array aproach previously shown.	[reply] [d/l]
Re^3: Munging with a regex dispatch table by Tanktalus (Canon) on Feb 02, 2005 at 05:15 UTC
No doubt that this is an improvement over the original. But I think that someone should point out the negatives of this as well. First, to reiterate the positive: IMO, this table is much better than the original, if only to eliminate the qr from the keys. But it also keeps the regexp close to the function that it dispatches to. However, I would not recommend this dispatch table for a couple of reasons. First, it's not really a table. Dispatch tables are called tables for a reason, I think. HoA is not a table - although AoH is a table (with named columns), and AoA is a table (with numbered columns). If you were to add another field which was some sort of ordering that could be used to sort the keys, then your HoA would be a column (with ordered/orderable rows). I prefer names to the columns. Makes things easier to see. Also makes things easier to modify. And thus easier to maintain. Better than constants since the "column" names are in the structure, not external to it. Which is easier to read: `[ qr/^(?!START:\|END:)([^: ][^:]+):(.+)\n/, \&header ]` [download] ... or ... `{ re => qr/^(?!START:\|END:)([^: ][^:]+):(.+)\n/, action => \&header }` [download] Somewhat of a tossup. Now, let's add four more items (columns) to the table. Or, given 10 rows, let's add an optional item to half of them. Those hashes are flexible. Use 'em. I'm not really convinced that the keys you have here are really very useful - we don't do a quick lookup of a particular entry, and we don't really use them for anything else. They could really be named a, b, 12, and fourty_two, and nothing would change. (Well, you may get a different pseudo-random order from keys. But that could change just by changing perl versions...) Here an array makes much more sense. Oh - and you're missing the >'s in each entry - should be `header => ...` ;-)	[reply] [d/l] [select]
Re: Munging with a regex dispatch table (with Regexp::Assemble) by grinder (Bishop) on Feb 02, 2005 at 22:40 UTC
I don't see how a module could reduce the code substantially since only the dispatch loop is reusable. You can sweep a lot of the nitty-gritty details under the carpet by using a module I wrote :) use Regexp::Assemble; my $ra = Regexp::Assemble->new->track(1); my %dispatch = ( '^(?!START:\|END:)([^: ][^:]+):(.+)\n' => sub { my $r = shift; process_record($r); # first time will be empty record... %$r = (); # clear the record $r->{name} = $ra->mvar(1); $r->{desc} = $ra->mvar(2); }, '^START:\s(.+)\n' => sub { $_[0]->{starttime} = $ra->mvar(1) }, '^END:\s(.+)\n' => sub { $_[0]->{endtime} = $ra->mvar(1) }, '^(?:\ \| )([^:]+):(\d+):(.+)\n' => sub { my @captures = @{$ra->mvar()}; push @{$_[0]->{items}}, [@captures[1..$#captures]], }, ); # add all the regular expressions to the R::A object $ra->add( keys %dispatch ); my $r = {}; # 'record' LINE: while (my $line = <DATA>) { # if we have a match, take the initial regular expression that match +ed # and use it as a key into our dispatch table and call its value. $ra->match($line) and &$dispatch{$ra->matched}->($r); } process_record($r); [download] I would be inclined to chomp $line, and remove the \n's from your expressions, but that's just personal taste. The trick is that when you enable tracking on a Regexp::Assemble object, when a match is made, it is able to recover the initial pattern that triggered the match. Therefore, you can use it as a key into a dispatch table, and things will Just Work. You have to use the `mvar` method to recover the values of $1, $2, $3... etc, because they are lost in scope by the time control flow gets back to your code. The first element returned by `mvar()` without a specfific parameter is another way of recovering the original pattern; it's a byproduct of walking down the `@-` and `@+` arrays at the time of the match. ponders which leads me to conclude that there should be a method call that returns only the $1..$n array of captures... - another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^2: Munging with a regex dispatch table (with Regexp::Assemble) by Solo (Deacon) on Feb 03, 2005 at 15:49 UTC
For some silly reason I thought all Regexp:: modules worked like Rexexp::Common! Thanks for pointing out yours (and the others), grinder. --Solo -- You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.	[reply]