1) identify the lexical structure of the material -- can it be multiline, does indentation matter, etc.?
2) create a simple lexical analyser out of a hash of regexes and token names.
3) create a thrower or two that ejects white space and/or empty lines, comments etc.
4) create a trivial parser that calls the trivial lexer and thrower and has a subroutine to manage each type of opening landmark (encounter with an identifying string), typically loading it into a suitable structure or printing directly at the end of the section (via closing landmark)
5) if not printing as we go, traverse and print the structure
Update: code example of a lexer
1;package logparse; sub new { return bless { LEX => { '\w+' => 'TOK_ID', '^[:punct:]+' => 'TOK_PUNCT', # and so on for all character classes you +identify }}; } sub lex { my $self = shift; my $fh = $self -> { FH }; $self -> { BUFFER } ||= <$fh> or goto EOF; PAT: while ( my ($pat, $tok) = each %{ $self -> { LEX }} ) { $/^($pat)(.*)$/ or next PAT; $self -> { BUFFER } = $2; $self -> { LEXVAL } = $1; return $tok; } $self -> { LEXVAL } = substr( $self -> { BUFFER }, 0, 1 ); $self -> { BUFFER } =~ s/^.//; warn "unhandled content at $fh line $.\n"; return ''; EOF: $self -> { LEXVAL } = ''; return 'TOK_EOF'; }
One world, one people
In reply to Re: Perl: Extracting specific text from a .txt file and outputting into a new format
by anonymized user 468275
in thread Perl: Extracting specific text from a .txt file and outputting into a new format
by ragingwhisky
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |