johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've got a client-server socket setup, where the connection is kept open for a long time, and the client sends some commands to the server when it wants to, each command is a mixture of ascii and binary, say of the form:
START NAME FOO PAIRS 4 A 3 B 5 C 1.2 D some_binary_data_with_first_byt +e_indicating_length
here the PAIRS value 4 indicates how many pairs are there afterwards (e.g., 4: A, B, C, D). I need to be able to handle the following situations: (1)commands may come in succession contineously, or very sparsely; (2) commands need to be processed right away if it is complete, so can't wait for the start of the next command; (3)there might not be a separator at the end of the command, even not a space; (4) command may be corrupted, so "START" can appear before the last one is done, in which case, ignore the last one

I'm thinking I may need to implement a DFA myself, but that's already part of the built-in regex engine, so is there an easy way?

greatly appreciated, thanks in advance.

Replies are listed 'Best First'.
Re: regex for a socket stream
by BrowserUk (Patriarch) on Jul 17, 2004 at 00:25 UTC

    I think your overstating the probelms here. Even a socket stream isn't continuous. Transmission is in packets, reads are in buffer sized chunks.

    Once you have your first buffer load, you can overlap the processing, or at least inspection of that with the next read.

    Processing a buffer is just string manipulation which perl excels at. No need for low-level code here.

    If when looking at the first buffer load it contains a START but not the complete command, ten you have to wait for the next buffer load anyway. If when you get the next buffer load it contains another START befoe the completion of the previous, you discard the previous.

    Tuning responsiveness -v- throughtput comes down to adjusting the read size and making sure that the inspection process takes less time than the read.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: regex for a socket stream
by zentara (Cardinal) on Jul 16, 2004 at 19:32 UTC
    I wouldn't waste time trying to send stuff over the tcp connection, then try to break it up with a regex. It is better to send it as a hash, then all you need to do is keep track of keys.

    I would use Net::EasyTCP and send the data as hash elements. You can send a hash as easy as $client->send(\%hash) , then just read the hash at the other end. I would base64 encode the binary data, and decode it at the other end.

    I just post something you could look at Tk encrypted echoing-chat client and server

    If you don't want to use Net::EasyTCP, you can serialize the hash yourself, with Storable, and send it.


    I'm not really a human, but I play one on earth. flash japh
      Thanks, zentara. The problem for me is that I do not have control over the client, so have to work with what it sends me.
Re: regex for a socket stream
by saintmike (Vicar) on Jul 16, 2004 at 19:57 UTC
    That's not easily done with a regex, I'd recommend Parse::RecDescent to whip up a quick parser.
Re: regex for a socket stream
by blokhead (Monsignor) on Jul 16, 2004 at 20:09 UTC
    As far as I know, you can't match a regex on a filehandle -- you have to have the entire string ready at the time of match. The regex engine won't "ask for more data" to become available.

    You could build a plain old DFA. A DFA implementation is really simple, but the painful part is building the transition table. A DFA is a very low-level machine, so you have to end up writing very low-level code for it. Plus, you generally read one character at a time, which isn't very efficient.

    What would be nicer is to have a state machine that reads one token at a time. So you should look at a lexer. A lexer will take a data stream (filehandle), and tokenize is as it comes in. This has the advantage of still processing the data as soon as it's available, but the token processing being much higher-level than a character-by-character DFA. Look at Parse::Lex (documentation en francais) and Parse::Flex.

    blokhead