Mugatu

It is helpful (at least it was for me) to think of a parser as a pipelined process.

First you tokenize the string, once you have broken all your text into the right bits, you pass that to the lexical analyzer (aka - lexer).

The lexer will then process the tokens and analyze them determining their "type". Bascially token1 is a string, token2 is an operator, token3 is a bracket, etc etc etc.

Once you have a set of properly classified tokens, you can then build an abstract syntax tree to represent their structure. This results in what is commonly called a "parse tree". If you are familiar with the XML/HTML-DOM, those are basically parse tree's of the XML/HTML documents.

At this point, you have your parse tree, and the parsing is completed. Now of course you need to figure out what to actually do with that parse tree :)

Now, the process I described is not the only way to parse, many parser do all this in one step, or combine a a couple steps together (tokenizing and lexical analysis are commonly combined together). But breaking it down into these steps was what helped me to learn how to write parsers. Hope this helps.

-stvn

In reply to Re: Basics of parsing (using RTF as a testbed) by stvn
in thread Basics of parsing (using RTF as a testbed) by Mugatu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.