in reply to Parsing Text into Arrays..

First test results:

Testdata:

({"startup-req-3",5,"PerlMud-i3",0,"*gcs",0,0,0,0,0,0,0,"PerlLib v1"," +PerlLib v1","Perl v5.6.1","Perl","restricted access","blah@blah.de",( +["tell":1,"ucache":1,]),0,})
IO's code produces [], merlyns 'undef' <- (I tried to change the parens to ([ ]) etc. here), Gilimanjaros gives up on the hash, demerphqs parser is created before the benchmark runs.
So I guess the timings are a little skewed anyway ;)

Benchmark: running IO, castaway, demerhpq, merlyn, browserUK , each for at least 5 CPU seconds ... IO: 1 wallclock secs ( 1.09 usr + 0.01 sys = 1.10 CPU) @ 37.27/s (n +=41) browserUK: 1 wallclock secs ( 1.04 usr + 0.01 sys = 1.05 CPU) @ 113 +.33/s (n=119) castaway: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 103. +74/s (n=111) demerhpq: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 3.7 +4/s (n=4) merlyn: 1 wallclock secs ( 1.04 usr + 0.01 sys = 1.05 CPU) @ 187.62 +/s (n=197) Rate demerhpq IO castaway browserUK merlyn demerhpq 3.74/s -- -90% -96% -97% -98% IO 37.3/s 897% -- -64% -67% -80% castaway 104/s 2675% 178% -- -8% -45% browserUK 113/s 2932% 204% 9% -- -40% merlyn 188/s 4919% 403% 81% 66% --
Benchmark called like this:
cmpthese(-5, { # Gilimanjaro => sub {my @arr = parse($testarray1)}, castaway => sub {my @arr = lpcarray_to_array( $test +array1 ); print Dumper(@arr)}, IO => sub{my @arr = lpcarray_to_arrayIO( +$testarray1 ); print Dumper(@arr)}, merlyn => sub{my @arr = parse_string( $testarra +y1); print Dumper(@arr)}, demerhpq => sub{my $arr = $parser->expr($testarray1) +; print Dumper($arr)},; browserUK => sub{my $arr = build_structure($testarray +1); print Dumper($arr)}});
More to come when I figure out how to get the rest running properly ;)

Update: Hmm, that seems to be junk, since cmpthese isn't reading $testarray1 (which is declared as 'my').. *sigh*
Update 2: Real results, I hope, thanks BrowserUK!
my script results (not a full time webserver)

C.

Replies are listed 'Best First'.
Re: Re: Parsing Text into Arrays..
by BrowserUk (Patriarch) on Jan 22, 2003 at 15:53 UTC

    FWIW: Here's a cleaned up version of mine, which runs a bit quicker as a consequence. There were several stupidities in the previous version.

    Updated Code updated to corrected a major flaw exposed by better testing. Slowed it down a bit, bit not so very much.

    our $re_terminate = qr[ \s* (?: : | , | $ ) ]x; our $re_content = qr[ \s* \( ( [\[\{] ) (.+) [\]\}]\) $re_terminate + ]x; our $re_nested = qr[ \( .+ \) ]x; our $re_serial = qr[ ( \( .+? \) ) $re_terminate ]x; our $re_qq = qr' " ( .+? ) (?!<\\)" 'x; our $re_csv = qr[ \G \s* (?: $re_qq | ( $re_nested ) | ( [^\s:, +]+? ) ) $re_terminate ]x; sub build_structure{ warn( "Data didn't match spec.", $_[0] ) , return undef unless $_[0] =~ m[$re_content]; my ($reftype, $content, $chunk, @bits) = ($1, $2); my $p = pos($content); while( $content =~ m[$re_csv]gc) { $chunk = $1||$3||$2||'0'; if ($chunk =~ m[^ \( [^\(]+? \) ]x) { pos($content) = $p; $content =~ m[$re_serial]gc; $p = pos($content); $chunk = $1; } push @bits, $chunk =~ m[^\(] ? build_structure( $chunk ) : $chunk; } return $reftype eq '[' ? {@bits} : \@bits; }

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Re: Parsing Text into Arrays..
by demerphq (Chancellor) on Jan 22, 2003 at 16:01 UTC
    Hmm. While I am not suprised about the P::RD solution that I wrote being the slowest, I would like to point a couple of things out. Most of these solutions are buggy or not appropriate for the data set you are using in some way or another (I can feel the venom in BrowserUks eyes already ;-)

    And by buggy I mean the following: I dont believe that you can consider a parser correct if it accepts something other than the specification it was intended to parse. Ie, I consider a parser to work correctly if and only if it parses a language specification correctly and nothing more.

    Anyway heres my analysis of the different solutions. Additional stuff is welcome.

    • First off IO's solution only returns the innermost array. Furthermore it throws the exception
      Unmatched ( before HERE mark in regex m/\(\{( << HERE \(\{()\}\)/ at C +:\Temp\castaway.pl line 282.
      when you feed it erroneous data like: '({({})'
    • Your own solution castaway returns empty values in the last slot in the array. (remember I pointed this out earlier). Also it returns ['']; when you feed it '({({})'. Similarly it doesnt handle barewords within arrays like '({foo})' properly. Also it returns '})' for '({})})'. Oh and it also seems to return the wrong value for ucache. Not to mention weirdness for ({(["foo":"bar"])})
    • merlyns solution doesnt handle hash values (it seems to work ok without them however). The only other case (that i checked) that his code choked on was '({})})'. (It parsed that as [])
    • BrowserUks seems to work pretty well. It will return an error message for some strings that arent correct, but it still doesnt correctly handle '({})})' or '({foo})' returning [')'] and ['o'] instead.
    • From the test cases that used it seems that my version correctly parses or rejects all the input I threw at it. Furthermore by changing the script slightly to
      value : string | number | array | hash
      it almost doubles its speed. No great feat I realize considering that it now does about 25 parses a second. :-)
    So first off I would say that when you test code you should test both positive and negative cases. Second off I can't feel bad about having the slowest solution as it quite frankly is the only one that correctly parses or rejects the data it recieves.

    To quote from Code Complete (28.2 Code Tuning)

      A fast program is just as important as a correct one--False! It's hardly ever true that programs need to be fast or small before they need to be correct. Gerald Weinberg tells the story of a programmer who was flown to Detroit to help debug a program that was in trouble. The programmer worked with the team of programmers who had developed the program, and after several days he concluded that the situation was hopeless.

      On the flight home, he mulled over the last few days and realized the true problem. By the end of the flight, he had outlined the necessary code. He tested it for several days and was about to return to Detroit when he got a telegram saying the project had been cancelled because the program was impossible to write. He headed back to Detroit anyway and convinced the executives that the project could be completed.

      He then had to convince the project's original programmers. They listened to his presentation, and when he'd finished, the creator of the old system asked,

      "And how long does your program take?"

      "That varies, but about ten seconds per input."

      "Aha! But my program takes only one second per input." The veteran leaned back, satisfied that he'd stumped the upstart programmer. The other programmers seemed to agree, but the new programmer wasn't intimidated.

      "Yes, but your program doesn't work. If mine doesn't have to work, I can make it run instantly and take up no memory. "

    (emphasis added by me.)

    However I have no doubt that with a bit of hacking all of the people involved here can fix their stuff and still have it faster than the P::RD approach, but this is a good example of why proper testing of both positive and negative cases is essential. Its also a good example of why P::RD is a cool tool. You are much more likely to produce a correct (but slow) result with it than without it, and to do so in much less time than with another approach.

    A last point about all of this. You are reciving packets over a network connection. So I'm guessing that just the network part takes much longer than even the slowest solution. Thus the network time is going to swamp the parse time by a great deal, and to me would suggest theres no point in optimizing this.

    I think having a read of the code complete article is a worthy use of time.

    Cheers,

    --- demerphq
    my friends call me, usually because I'm late....

      Seems you are antisipating a response from me, so here it is--sans venom (Good ol' Optrex :^).

      I dont believe that you can consider a parser correct if it accepts something other than the specification it was intended to parse.

      I agree.

      However (there had to be an however :), given the specification was 2 simple examples of sample input, it's difficult to derive a complete specification or a full set of testcases. My original version already exceeded the specification in as much as it attempted to handle the case of delimiters embedded in quotes.

      Whilst you can sit down and derive a set of possible testcases that it should handle, in the absence of an accurate, full specification, all you can do is guess. I also think that given the nature of PM, it falls on the OP to test and modify where necessary any solutions provided. If I were writing for production purposes, and being paid to do it, I would do considerably more testing, but then I would also demand more information up front regarding the nature of the task.

      That in no way contradicts your completeness argument, but it does add a little context to the equation.

      In a much later post, castaway posted a real sample of the (predefined to him/her) protocol:

      ({"startup-req-3",5,"PerlMud-i3",0,"*gcs",0,0,0,0,0,0,0,"PerlLib v1","PerlLib v1","Perl v5.6.1","Perl","restricted access","blah@blah.de",(["tell":1,"ucache":1,]),0,})

      which appears to indicate that no spaces are included (except within double quotes), which contrasts with the earlier simple examples. If the real specification states that there will be no inter-term white space, then it would simplify the problem and probably speed up the processing.

      A last point about all of this. You are reciving packets over a network connection. So I'm guessing that just the network part takes much longer than even the slowest solution. Thus the network time is going to swamp the parse time by a great deal, and to me would suggest theres no point in optimizing this.

      True, speed isn't everything (or indeed anything unless the code 'works'), but I eshew the idea that taking longer than necessary is ok if the process is IO-bound anyway.

      There are very few cases these days where a computer is either single-tasking, or doing only one task.

      In the case of a MUD, whilst some part of the process is reading and passing the input from the network, concurrent with this, there is almost certainly another part of the program that is processing the input received and updating state as a concequence of it; Another part that is getting, validating and actioning user (keyboard/mouse) input. If it is a graphical MUD, then maintaining the screen is likely to require every spare cycle it can get and then some.

      Even if this is a single threaded/tasking MUD that cannot itself do any concurrent processing, the user is quite likely to be running a browser, compiler, chat progs etc. simulataneuosly, and extra cycles used by the MUD spends parsing network input is detracting from them and anything else running. If the parser were being used at the server end for receiving information back from multiple clients on seperate threads, the performance would become critical.

      Back to the Completeness argument.

      Maybe the antithisis of the article you cite is the adage (possibly by one Paul Schindler, but I haven't confirmed this, nor looked to check his credentials) of "'The perfect', is the enemy of 'the good'".

      Or as I seen it written "'Good enough' is good enough" and "OP = OP" (with apologies to my dutch friends if the latter is mis-interpreted).

      In this case, if the server sent me a string of ({({}) or ({foo}) or ({})}), I would consider the output routine at the server broken rather than the input parser. Of course, corruption can occur during transmission, but I would expect that to be detected and correct by the transmission protocol rather than needing to account for the possibility in the parser.

      Of course, if the protocol/parser involved was responsible for running a nuclear powerstation, a heart monitor, the space shuttle (or commercial aeroplane) or the stock market, I would feel differently about this, but we are talking about a MUD.

      Finally, I am not at all sure that this is an 'answer' to your post exactly, nor even contradictory to it. It's more that there's more than one point of view and that neither is necessarially wrong. Sometimes one is more applicable to a given situation than another.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        Ok here goes browseruk...

        given the specificationwas 2 simple examples of sample input, it's difficult to derive a complete specification or a full set of testcases

        Yah. Fair point. (Look what I did with my original attempt in the readmore :-) My view however is that given the overall context it is reasonable to assume that unbalanced delimiters are illegal. Also that delimiters embedded in quotes should be handled ok. Now I agree that this is an assumption. (as in Ass-u-me) But likewise I think you would agree that its hardly an extreme one.

        If I were writing for production purposes, and being paid to do it, I would do considerably more testing, but then I would also demand more information up front regarding the nature of the task.

        That in no way contradicts your completeness argument, but it does add a little context to the equation.

        Again a fair point and a reasonable position. I still feel that in this case the need to handle unbalanced delimiters is not that unreasonable. Also I think that from this position I would grant handling barewords as strings, but I think that handling ({foo}) as [ 'o' ] can be fairly construed as a bug. Either it should have been accepted as  [ 'foo' ] or rejected as an invalid sentence. But I suppose the specs could have indicated that that was the correct parse but.... ;-)

        which appears to indicate that no spaces are included (except within double quotes), which contrasts with the earlier simple examples. If the real specification states that there will be no inter-term white space, then it would simplify the problem and probably speed up the processing.

        I don't really see how it would have improved the situation much, but presumably it reflects your seriously-deep-end-regex approach (which I say with more respect than it may sound, i'm in a good mood right now please forgive me ;-) and I admit I didnt look too much into. I think I would probably assume the solution needs to be white space tolerant just out of conventional practice. But then again I've never coded any python. :-) It does seem justifiable to take either approach however given the lack of specs as you pointed out earlier.

        but I eshew the idea that taking longer than necessary is ok if the process is IO-bound anyway

        There are very few cases these days where a computer is either single-tasking, or doing only one task.

        Sure theres something to what you say, but I think there comes a point that when you start talking about optimization theres a line where the question goes from "how do I make my perl code faster" to "which other native compiled language should I convert my script to". Its always a hard call to make though and I agree that while confined to perl (or any other language) a good programmer keeps in mind differences in speed when selecting his way to do things. Of course correctness does have to come first, but after that everything is a balancing act.

        In this case, if the server sent me a string of ({({})or ({foo})or ({})}), I would consider the output routine at the server broken rather than the input parser. Of course, corruption can occur during transmission, but I would expect that to be detected and correct by the transmission protocol rather than needing to account for the possibility in the parser

        Sorry but this part I just dont agree with. Of course if something sends you ({})}) then its their error, but you wouldnt have even known that there was an error. You might after a certain period become concerned that things weren't right, perhaps some serious fault would have occured later on that caused you to become suspicious, but without a lot of guesswork and research and quite possibly some added code to catch the error you really couldn't have said what it was. And for me thats the last place I want to be. Admittedly these things happen, and not all of my code is bullet proof, but I think this type of thing counts as basic handling. Like putting .. or die ... on little throw away scripts where you know the file is there your honor. You do it cause in the long run it saves you a lot of hassle.

        The other thing for me about this is that these seem to me to be basic test cases. If you fail these then there are probably, not necessarily perhaps but probably, valid sentences that you will also fail. At least I would never feel comfortable that there were none. I suppose its true that lacking a formal proof of your code you can rarely be certain that your code is completely error free, but these cases seem so basic that I can't agree that failing them is acceptable.

        but we are talking about a MUD

        Um, ok. I think we see this in pretty different ways. For the record, I had no idea what a MUD was and frankly didn't care, to me this was a bunch of interesting questions rolled up into one, that for me have practical uses. For instance parsing a config file or a complicated command line. Or a safe undumper. So having that kind of error catching is essential. From your point of view it was tiny cog in a big machine which to certain extent isnt that important. Controlling its input probably is the least of your worries so move on to other things. I guess thats fair. I certainly have written code that does minimal checking in the expectation that most error cases are simply impossible. (Usually because they have been dealt with elsewhere but not always)

        I think that theres another issue here. We're posting on a site where people come to learn stuff. Scary thought as it may be but theres a good chance theres somebody out there that may have learned something from one or more of our posts. I certainly come here and learn from people, all the time actually, its why I come back. So I think that when we post we need to put a little extra effort in to being correct and also complete, accepting it when someone calls us on it, and lastly checking the quality of our colleagues contribution. Ive been embrassed by being woefully wrong in the past, but in the end I appreciated that someone told me of it. Even if what they said gave me nothing more than a bad conscience and the thought that I was wrong, I learned from it at least by figuring out if they were right or not.

        Going back to these people reading our posts, I think the point is that one of them might think the solution is "complete and correct". I suppose that it can be argued that if they dont check into it independently then its their own fault, but I personally hope that some of the stuff Ive learned here doesnt need to be researched further.

        Anyway, the above is more an explanation of my approach than a comment on this situation. Even if we see this differently its quite possibly been (or will be) useful to somebody to see the different views and the comments associated. I certainly will remember your multi-tasking point the next time the subject of optimization comes up.

        Cheers BrowserUk

        --- demerphq
        my friends call me, usually because I'm late....

      Fun..
      Ok, lets answer a few points. You are of course correct that a finished solution should cater for incorrect input and act accordingly. It's also useful that the P::RD version does this without much extra effort.
      And its missing from my code, though adding it wont be too much hassle, I think :) I'll go look at the errors in mine, and implement a speed improvement that occured to me while getting up (all the best ideas appear in the waking up phase, somehow..)

      The second point is, that in order for incorrect data to be 'correctly' handled, one also has to define what should happen when the data is bad. So here's another challenge for you. Look at the data on castaway's scratchpad, if this sort of packet is incomplete, then it could still have complete information for some of the muds which I can use, so how about returning all the 'full' lists when possible? Though, thats now data-bound, and not just about complete structures any more.. If its not clear, the first seven entries are just a header, the first 'mud' information starts at (["Figment", and goes on until the end of the hash.

      I never doubted that Parse::RecDescent was cool :) And I've got a 'Code Complete' around here somewhere, he's right too..

      Oh, did you benchmark working versions of IOs and merlyns solutions?
      /me goes to add some error-checking..

      C.

      Is there any way to modify the grammar so it will parse TCL lists?

      e.g.

      {{name {first bob} {last bar}} {gay yes}}{{name {first josh} {last logitech}} {happy no}}

      Thanks!

Re: Re: Parsing Text into Arrays..
by Gilimanjaro (Hermit) on Jan 23, 2003 at 00:32 UTC
    Challenge! I'll build you one that does hashes too!

    Both an 'eval' (=evil) one, and a one using a stack!

    The heat is on!