I have something that's almost a list of tokens separated with whitespace (like the Forth language). Ideally, I could just use split in a trivial way to get a list of tokens.

However, one construct is for string literals. Something like 8"foobar" or U"tweedle" could have spaces within the quotes, and the token is properly delimited by the closing quote, not the whitespace.

I don't want to have to go to a full-blown fancy parser just to handle this one little case. I think a two-pass system could do it, the first pass noticing the quotes and escaping out any spaces inside them. But that seems in-elegant. Any regex wizards care to tackle this?

Two ideas: split is told what the delimiter is, as opposed to what to keep (as with a m//g). Using advanced regex features, tell it to reject space if it's in the middle of a quotation. Using @list=m/blah/g instead of split is different, and might be more straightforward by some ways of thinking about it, because it doesn't need to look outside of the area it's working on.

But, what about two distinct regular expressions sharing the same current position? See which matches the current spot, and immediatly know what to do with it rather than having to figure it out again. I thought I saw something about that once... the current position is part of the string, not part of the regex. But doesn't the regex instance also keeping track of something?


In reply to Not quite a simple split by John M. Dlugosz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.