You'll always have scaling problems when you write code like:

$aXML =~ s@</@{endtag}@gs; $aXML =~ s@/@{bs}@gs; $aXML =~ s@{endtag}@</@gs; $aXML =~ s@:@{colon}@gs; $aXML =~ s@;@{semicolon}@gs; ... $aXML =~ s@:;@\)\]@gs; $aXML =~ s@;([^:;]+?):@\[$1\(@gs; $aXML =~ s@`@>@gs; $aXML =~ s@{bs}@/@gs; $aXML =~ s@{colon}@:@gs; $aXML =~ s@{semicolon}@;@gs; ... while ($aXML =~ m@<([^<>]+?)>(.*?)</>@gs) { ... } ... while ($aXML =~ m@\[([^\[\]]*?)\]@gs) { ... }

By my count, you have to scan the aXML at least thirteen times to process it once, and some of those regular expressions have backtracking, so they'll end up scaling very badly too. For short aXML documents (a few dozen lines), it may be fast enough, but you'll start to notice performance degrade dramatically with documents of over a hundred lines.

With that said, this approach is more promising:

my @chars = split //, $aXML; ... foreach my $char (@chars) { ... }

... because it scales linearly with the size of the document. Perl 5's not super fast at processing strings character-by-character, but if you can write a state machine and decide what kind of Perl data structure to build at every state change of the document, you're much better off in terms of performance. This is what a lexer and grammar do when talking about compilers or custom languages. (You can even identify places where you don't have enough information to decide what to do right then, as in the case of your extension system—but you can encode that in your data structure and during evaluation decide what to do when you know what you need to know.)

Higher-Order Perl and SICP both describe how to handle this.

Incidentally, this is why people often say "Don't use regular expressions to parse _____!" — not because it's impossible to do, but because regular expressions really don't let you identify the state of individual items within a document in a way amenable to handling them correctly.


In reply to Re: Fast enough yet? by chromatic
in thread Fast enough yet? by Logicus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.