comment on

I frequently have to parse all kinds of output and every case is different, but I tend to go through the following process, each step makes the next one trivial, once you get the hang of it:

1) identify the lexical structure of the material -- can it be multiline, does indentation matter, etc.?

2) create a simple lexical analyser out of a hash of regexes and token names.

3) create a thrower or two that ejects white space and/or empty lines, comments etc.

4) create a trivial parser that calls the trivial lexer and thrower and has a subroutine to manage each type of opening landmark (encounter with an identifying string), typically loading it into a suitable structure or printing directly at the end of the section (via closing landmark)

5) if not printing as we go, traverse and print the structure

Update: code example of a lexer

package logparse;

sub new {
    return bless { LEX => { '\w+' => 'TOK_ID',
                            '^[:punct:]+' => 'TOK_PUNCT',
                            # and so on for all character classes you 
+identify
                 }};
}
 
sub lex { 
    my $self = shift;
    my $fh = $self -> { FH };
    $self -> { BUFFER } ||= <$fh> or goto EOF;
    PAT: while ( my ($pat, $tok) = each %{ $self -> { LEX }} ) {
        $/^($pat)(.*)$/ or next PAT;
        $self -> { BUFFER } = $2;
        $self -> { LEXVAL } = $1;
        return $tok;
    }

    $self -> { LEXVAL } = substr( $self -> { BUFFER }, 0, 1 );
    $self -> { BUFFER } =~ s/^.//;
    warn "unhandled content at $fh line $.\n";
    return '';

EOF: $self -> { LEXVAL } = '';
     return 'TOK_EOF';
}
[download]

One world, one people

In reply to Re: Perl: Extracting specific text from a .txt file and outputting into a new format by anonymized user 468275
in thread Perl: Extracting specific text from a .txt file and outputting into a new format by ragingwhisky

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.