Thanks to kind advice given by various nice folks around here I've been working on a different approach. This seems pretty fast, but is it fast enough me wonders. This version only deals with standard < > tags.

#!/usr/bin/perl; use Modern::Perl; use Time::HiRes qw( gettimeofday tv_interval ); my $qd; $qd->{'a'} = 'b'; $qd->{'b'} = 'c'; $qd->{'c'} = 21; $qd->{'action'} = "default"; $qd->{'lang'} = "en"; my $knowncmds = "(qd|double|use)"; say aXML_eval(qq@<html lang="<qd a="b">lang</qd>"> <head> <title><qd a="b" b="c">action</qd></title> </head> <body> orphan data 1 <double><qd><qd><qd>a</qd></qd></qd></double> <use>actions/<qd>action</qd>/body.aXML</use> orphan data 2 </body></html>@ ); sub aXML_qd { $qd->{$_[1]}; } sub aXML_double { $_[1]*2; } sub aXML_use { "opened file :: $_[1]"; } sub aXML_eval { my $aXML = shift; $aXML =~ s@</@{endtag}@gs; $aXML =~ s@/@{bs}@gs; $aXML =~ s@{endtag}@</@gs; $aXML =~ s@:@{colon}@gs; $aXML =~ s@;@{semicolon}@gs; my @chars = split //, $aXML; my $level = 0; my $dropping_level; my $dropping = 0; my $dropnext = 0; my @finalletters; foreach my $char (@chars) { if ($dropnext) { $dropping = 1; $dropnext = 0; } if ($char eq "<") { $level++; } if ($char eq ">") { $level--; } if ($char eq "/") { unless($dropping) { $dropping_level = $level; $dropnext = 1; } } if ($dropping) { if ($level < $dropping_level) { $dropping = 0; } + } push (@finalletters,$char) unless $dropping; } $aXML = join('',@finalletters); while ($aXML =~ m@<([^<>]+?)>(.*?)</>@gs) { my $original_tag = $1; my $original_data = $2; my $original_string = "<$1>$2</>"; my $replace_string; if ($original_tag =~ m@([^\s]+?)\s(.*)@gs) { my $tag_name = $1; my $tag_attr = $2; $tag_attr =~ s@(.*?)="(.*?)"@ $1 =` '$2',@gs; chop($tag_attr); $replace_string = ";$tag_name:{$tag_attr },'$original_data':;" +; } else { $replace_string = ";$original_tag:{},'$original_data':;"; } $aXML =~ s@$original_string@$replace_string@gs; } $aXML =~ s@:;@\)\]@gs; $aXML =~ s@;([^:;]+?):@\[$1\(@gs; $aXML =~ s@`@>@gs; $aXML =~ s@{bs}@/@gs; $aXML =~ s@{colon}@:@gs; $aXML =~ s@{semicolon}@;@gs; say "save this to disc...\n"; say "$aXML\n\n\n"; say "then process to get this...\n\n"; my $start = [ gettimeofday ]; while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { my $tag = $1; my $original_string = "\[$1\]"; my $replace_string = ""; if ($tag =~ m@(.*)\(\{(.*?)\},'(.*)'\)@gs) { my $cmd = $1; my $args = $2; my $data = $3; if ($cmd =~ m@$knowncmds@s) { eval("\$replace_string = aXML_$tag;"); } else { my $new_args; if ($args =~ m@\S@g) { my $arg_string = "\$temp_arg_hash = \{ $args \}"; my $temp_arg_hash; eval ($arg_string); while (my ($key,$val) = each(%$temp_arg_hash) ) { $new_args .= " $key=\"$val\""; } $args = $new_args; } $replace_string = "<$cmd$args>$data</$cmd>"; } } $original_string =~ s@\(@\\\(@gs; $original_string =~ s@\)@\\\)@gs; $original_string =~ s@\[@\\\[@gs; $original_string =~ s@\]@\\\]@gs; $original_string =~ s@\{@\\\{@gs; $original_string =~ s@\}@\\\}@gs; $aXML =~ s@$original_string@$replace_string@gs; } my $end = [ gettimeofday ]; my $total_elapsed = tv_interval($start,$end); $aXML .= "\n\nprocessing took : $total_elapsed seconds"; return $aXML; }

Replies are listed 'Best First'.
Re: Fast enough yet?
by chromatic (Archbishop) on Aug 05, 2011 at 22:15 UTC

    You'll always have scaling problems when you write code like:

    $aXML =~ s@</@{endtag}@gs; $aXML =~ s@/@{bs}@gs; $aXML =~ s@{endtag}@</@gs; $aXML =~ s@:@{colon}@gs; $aXML =~ s@;@{semicolon}@gs; ... $aXML =~ s@:;@\)\]@gs; $aXML =~ s@;([^:;]+?):@\[$1\(@gs; $aXML =~ s@`@>@gs; $aXML =~ s@{bs}@/@gs; $aXML =~ s@{colon}@:@gs; $aXML =~ s@{semicolon}@;@gs; ... while ($aXML =~ m@<([^<>]+?)>(.*?)</>@gs) { ... } ... while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { ... }

    By my count, you have to scan the aXML at least thirteen times to process it once, and some of those regular expressions have backtracking, so they'll end up scaling very badly too. For short aXML documents (a few dozen lines), it may be fast enough, but you'll start to notice performance degrade dramatically with documents of over a hundred lines.

    With that said, this approach is more promising:

    my @chars = split //, $aXML; ... foreach my $char (@chars) { ... }

    ... because it scales linearly with the size of the document. Perl 5's not super fast at processing strings character-by-character, but if you can write a state machine and decide what kind of Perl data structure to build at every state change of the document, you're much better off in terms of performance. This is what a lexer and grammar do when talking about compilers or custom languages. (You can even identify places where you don't have enough information to decide what to do right then, as in the case of your extension system—but you can encode that in your data structure and during evaluation decide what to do when you know what you need to know.)

    Higher-Order Perl and SICP both describe how to handle this.

    Incidentally, this is why people often say "Don't use regular expressions to parse _____!" — not because it's impossible to do, but because regular expressions really don't let you identify the state of individual items within a document in a way amenable to handling them correctly.

      A state machine that does it all in one pass has been my holy grail for like 4 years now. Btw.. when I say I've been working on it for 4 years that's not like 4 years of sustained effort or anything, it's a good coding session every now and then interspersed with a lot of other things!

      Surely this  $aXML =~ s@;([^:;]+?):@\[$1\(@gs; is the only backtracker there, and also it comes prior to the "save this to disc then process" marker so it only gets run once when the page is still in it's raw aXML state, not every time the page is requested.

      That leaves this while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { ... }

      Which makes as many passes as it needs to decode the structure, in what I visualise as a sort of 3d way with the innermost tags being the highest peaks getting mown down one tag height at a time, until the document is flat.

      Hrm, maybe I could save the $level data into an array to determine character positions for processing... that would help!

        You wouldn't have to scan the entire structure for every invocation if you returned ASTs from extensions rather than strings that needed parsing, or if you parsed the strings returned from extensions into their own trees and grafted them into the tree.

        Last time, I promise—the first chapter or so of SICP explains evaluation order concerns. It would really help you to read it.

Re: Fast enough yet?
by Anonymous Monk on Aug 09, 2011 at 06:35 UTC

    No.