in reply to Fast enough yet?

You'll always have scaling problems when you write code like:

$aXML =~ s@</@{endtag}@gs; $aXML =~ s@/@{bs}@gs; $aXML =~ s@{endtag}@</@gs; $aXML =~ s@:@{colon}@gs; $aXML =~ s@;@{semicolon}@gs; ... $aXML =~ s@:;@\)\]@gs; $aXML =~ s@;([^:;]+?):@\[$1\(@gs; $aXML =~ s@`@>@gs; $aXML =~ s@{bs}@/@gs; $aXML =~ s@{colon}@:@gs; $aXML =~ s@{semicolon}@;@gs; ... while ($aXML =~ m@<([^<>]+?)>(.*?)</>@gs) { ... } ... while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { ... }

By my count, you have to scan the aXML at least thirteen times to process it once, and some of those regular expressions have backtracking, so they'll end up scaling very badly too. For short aXML documents (a few dozen lines), it may be fast enough, but you'll start to notice performance degrade dramatically with documents of over a hundred lines.

With that said, this approach is more promising:

my @chars = split //, $aXML; ... foreach my $char (@chars) { ... }

... because it scales linearly with the size of the document. Perl 5's not super fast at processing strings character-by-character, but if you can write a state machine and decide what kind of Perl data structure to build at every state change of the document, you're much better off in terms of performance. This is what a lexer and grammar do when talking about compilers or custom languages. (You can even identify places where you don't have enough information to decide what to do right then, as in the case of your extension system—but you can encode that in your data structure and during evaluation decide what to do when you know what you need to know.)

Higher-Order Perl and SICP both describe how to handle this.

Incidentally, this is why people often say "Don't use regular expressions to parse _____!" — not because it's impossible to do, but because regular expressions really don't let you identify the state of individual items within a document in a way amenable to handling them correctly.

Replies are listed 'Best First'.
Re^2: Fast enough yet?
by Logicus (Initiate) on Aug 06, 2011 at 12:24 UTC

    A state machine that does it all in one pass has been my holy grail for like 4 years now. Btw.. when I say I've been working on it for 4 years that's not like 4 years of sustained effort or anything, it's a good coding session every now and then interspersed with a lot of other things!

    Surely this  $aXML =~ s@;([^:;]+?):@\[$1\(@gs; is the only backtracker there, and also it comes prior to the "save this to disc then process" marker so it only gets run once when the page is still in it's raw aXML state, not every time the page is requested.

    That leaves this while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { ... }

    Which makes as many passes as it needs to decode the structure, in what I visualise as a sort of 3d way with the innermost tags being the highest peaks getting mown down one tag height at a time, until the document is flat.

    Hrm, maybe I could save the $level data into an array to determine character positions for processing... that would help!

      You wouldn't have to scan the entire structure for every invocation if you returned ASTs from extensions rather than strings that needed parsing, or if you parsed the strings returned from extensions into their own trees and grafted them into the tree.

      Last time, I promise—the first chapter or so of SICP explains evaluation order concerns. It would really help you to read it.

        Chromatic, I want you to know that your responses have been among those I have most valued and enjoyed reading. And I'm not just saying that because I know who you are, I mean it, you and Boldra + a couple of others have taught me huge amounts in a very short time, and I don't believe I ever even caught a hint of condescension in the words coming from your direction. So I just wanted to say that and thanks.