I will show you some different coding of a state machine parser for your example file which contains 3 tables.

Each table has 3 parts: 1)the name of the table, 2)the column definitions of the table, and 3)the data for each row in the table. All 3 of your example tables have these same 3 parts.

The code below cycles through 3 states:

  1. GET_TABLE_NAME
  2. GET_COL_NAMES
  3. GET_DATA
One general method for writing code like this is to decide how you tell when each phase or state of the data capture "ends". It is often a bit easier if the line that signals the "end of a record or phase" is a distinct thing from a line that signals "start of a new record or phase". In your test case, the first 2 tables comply with this, but the third table does not because there is no '------' banner right before the column names (that would signal end of the Table Name capture). I suspect that this was a typo, but I allowed for it anyway(see below).

So anyway, the thinking goes: if we are in Phase 1,2,3 and the line that we just read means that the current Phase has not ended, then we process the current line for the current Phase. Otherwise, the current Phase ends, any "clean-up" is done and the overall state transistions to the next Phase.

Below, the state transitions are 1->2->3->1->2->3->1, etc.

I wrote the code and it worked for the first 2 tables, then I found out that something was odd about table 3. So I used a special technique in Perl to code an exception to the rule of what ends the "finding Table Name" phase.

In Perl the "redo" statement restarts the while (condition){...} loop without re-evaluating the condition. In this case, we see that the COL_NAMES phase has already started. So I just adjust the Phase or State to be 'GET_COL_NAMES' and restart the loop without reading another line. There are of course other ways of accomplishing this same goal. This techniqe just happened to surface at the moment.

I didn't worry about tweaking the splits or regex'es. Often this just doesn't matter as disk I/O is usually the slowest part.

The main thing I wanted to show in this post was a method to section the code into easy to identifiable states or phases. Some details of how each 'state' is handled could be different, but that is not my main point.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @results=(); # this is Array of Array, # [$table_name, [@data]] # row[0] of @data contains the column names ################ my $state; my $name; my @data; my @col_names; sub start_new_table_entry { $state = 'GET_TABLE_NAME'; $name = ""; @data=(); @col_names=(); } sub finish_current_table { if ($state ne 'GET_TABLE_NAME') { unshift @data,[@col_names]; push @results,[$name,[@data]]; $state = 'GET_TABLE_NAME'; } } start_new_table_entry(); REDO_LINE: while (my $line = <DATA>) { $line =~ s/^\s*//; # delete leading spaces $line =~ s/\s*$//; # delete trailing spaces # (this includes line endings) if ($state eq 'GET_TABLE_NAME') #### TABLE NAME ### { if ($line =~ /^\|/) # premature start of column name state! Who +a! { # special case of malformed table without # a starting banner of --- or _ _ _ # we are already in the column name state! $state = 'GET_COL_NAMES'; redo REDO_LINE; } elsif ($line !~ /^[-_]/) #keep going - normal case { $name = $line if $line ne ""; # get last non blank line befo +re table $name =~ s/\s*\:\s*\d+$//; # cleans up the name (if any) } else { $state = 'GET_COL_NAMES'; } } elsif ($state eq 'GET_COL_NAMES') #### COLUMN NAMES ### { if ($line !~ /(^\|[-_])|(^[-])/ ) #keep going { $line =~ s/^\|\s*//; my @col_name_raw = split /\|/,$line; my $col=0; foreach my $this_col (@col_name_raw) { $this_col =~ s/\s*$//; $this_col =~ s/^\s*//; $col_names[$col]//= ""; $this_col = " $this_col" if ($col_names[$col] ne ""); $col_names[$col++] .= "$this_col"; } } else { $state = "GET_DATA"; } } elsif ($state eq 'GET_DATA') #### DATA ROWS ### { if ( $line =~ /^\|/) #keep going { $line =~ s/^\|\s*//; my @this_data = split /\|/,$line; @this_data = map {s/\s*$//;s/^\s*//;$_}@this_data; push @data,[@this_data]; } else { finish_current_table(); start_new_table_entry(); } } } finish_current_table(); # in case of malformed end of table # dump results in "psuedo" CSV format # also consider looking at: # print Dumper \@results; foreach my $tableref (@results) { my ($name,$dataref) = @$tableref; print "TABLE: '$name'\n"; my $row0 = shift @$dataref; print "COLUMNS: ",join(",",@$row0),"\n"; foreach my $row (@$dataref) { print join(",",@$row),"\n"; } print "\n"; } =PRINTED OUTPUT TABLE: 'place and year data' COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 'work and language' COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: 'Position log' COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 =cut __DATA__ asdfasdf some trash in the file... $$#@more trash place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | some more trash here 123947982374 work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|

In reply to Re: Parsing .txt into arrays by Marshall
in thread Parsing .txt into arrays by Fshah

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.