in reply to Parsing .txt into arrays

I will show you some different coding of a state machine parser for your example file which contains 3 tables.

Each table has 3 parts: 1)the name of the table, 2)the column definitions of the table, and 3)the data for each row in the table. All 3 of your example tables have these same 3 parts.

The code below cycles through 3 states:

  1. GET_TABLE_NAME
  2. GET_COL_NAMES
  3. GET_DATA
One general method for writing code like this is to decide how you tell when each phase or state of the data capture "ends". It is often a bit easier if the line that signals the "end of a record or phase" is a distinct thing from a line that signals "start of a new record or phase". In your test case, the first 2 tables comply with this, but the third table does not because there is no '------' banner right before the column names (that would signal end of the Table Name capture). I suspect that this was a typo, but I allowed for it anyway(see below).

So anyway, the thinking goes: if we are in Phase 1,2,3 and the line that we just read means that the current Phase has not ended, then we process the current line for the current Phase. Otherwise, the current Phase ends, any "clean-up" is done and the overall state transistions to the next Phase.

Below, the state transitions are 1->2->3->1->2->3->1, etc.

I wrote the code and it worked for the first 2 tables, then I found out that something was odd about table 3. So I used a special technique in Perl to code an exception to the rule of what ends the "finding Table Name" phase.

In Perl the "redo" statement restarts the while (condition){...} loop without re-evaluating the condition. In this case, we see that the COL_NAMES phase has already started. So I just adjust the Phase or State to be 'GET_COL_NAMES' and restart the loop without reading another line. There are of course other ways of accomplishing this same goal. This techniqe just happened to surface at the moment.

I didn't worry about tweaking the splits or regex'es. Often this just doesn't matter as disk I/O is usually the slowest part.

The main thing I wanted to show in this post was a method to section the code into easy to identifiable states or phases. Some details of how each 'state' is handled could be different, but that is not my main point.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @results=(); # this is Array of Array, # [$table_name, [@data]] # row[0] of @data contains the column names ################ my $state; my $name; my @data; my @col_names; sub start_new_table_entry { $state = 'GET_TABLE_NAME'; $name = ""; @data=(); @col_names=(); } sub finish_current_table { if ($state ne 'GET_TABLE_NAME') { unshift @data,[@col_names]; push @results,[$name,[@data]]; $state = 'GET_TABLE_NAME'; } } start_new_table_entry(); REDO_LINE: while (my $line = <DATA>) { $line =~ s/^\s*//; # delete leading spaces $line =~ s/\s*$//; # delete trailing spaces # (this includes line endings) if ($state eq 'GET_TABLE_NAME') #### TABLE NAME ### { if ($line =~ /^\|/) # premature start of column name state! Who +a! { # special case of malformed table without # a starting banner of --- or _ _ _ # we are already in the column name state! $state = 'GET_COL_NAMES'; redo REDO_LINE; } elsif ($line !~ /^[-_]/) #keep going - normal case { $name = $line if $line ne ""; # get last non blank line befo +re table $name =~ s/\s*\:\s*\d+$//; # cleans up the name (if any) } else { $state = 'GET_COL_NAMES'; } } elsif ($state eq 'GET_COL_NAMES') #### COLUMN NAMES ### { if ($line !~ /(^\|[-_])|(^[-])/ ) #keep going { $line =~ s/^\|\s*//; my @col_name_raw = split /\|/,$line; my $col=0; foreach my $this_col (@col_name_raw) { $this_col =~ s/\s*$//; $this_col =~ s/^\s*//; $col_names[$col]//= ""; $this_col = " $this_col" if ($col_names[$col] ne ""); $col_names[$col++] .= "$this_col"; } } else { $state = "GET_DATA"; } } elsif ($state eq 'GET_DATA') #### DATA ROWS ### { if ( $line =~ /^\|/) #keep going { $line =~ s/^\|\s*//; my @this_data = split /\|/,$line; @this_data = map {s/\s*$//;s/^\s*//;$_}@this_data; push @data,[@this_data]; } else { finish_current_table(); start_new_table_entry(); } } } finish_current_table(); # in case of malformed end of table # dump results in "psuedo" CSV format # also consider looking at: # print Dumper \@results; foreach my $tableref (@results) { my ($name,$dataref) = @$tableref; print "TABLE: '$name'\n"; my $row0 = shift @$dataref; print "COLUMNS: ",join(",",@$row0),"\n"; foreach my $row (@$dataref) { print join(",",@$row),"\n"; } print "\n"; } =PRINTED OUTPUT TABLE: 'place and year data' COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 'work and language' COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: 'Position log' COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 =cut __DATA__ asdfasdf some trash in the file... $$#@more trash place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | some more trash here 123947982374 work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|

Replies are listed 'Best First'.
Re^2: Parsing .txt into arrays
by Fshah (Initiate) on May 26, 2017 at 05:53 UTC
    thank you marshall, I really appreciate your response , my problem now is to extract same type of tables lets say position log table here, as all the position log tables have same format, i want to know the start and end line numbers of these tables ,how can I do that mind you I have a huge text file order of gb's I want to extract similar tables and export them so that I can code them table specific (also table doesn't contain just a table it has some details, here place and time %I want to store these as well % )

    approach: say every position log table has extension Fp379 and all pages start with the year ,I want to use these keywords%Fp379 for position log page and 2017 for all pages % to separate the required tables

    like wise similar pages start with same extension.
    2017 Position log :Fp379 place: cal time: 23:01:45 | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|

      Fshah:

      The $. variable contains the line number last read from the last filehandle access. So you can simply store the value each time you start a new array and add the starting line number to your table of information. You can read more about the $. variable at perlvar.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      Hi Fshah,
      Ok, for these extra requirements, I modified the GET_NAME state to allow for multi-line names instead of just keeping the last non-blank line before the table starts. Keeping track of the line numbers from the original file sounds weird, but I added that info to the $name record using $., the current file handle's current line number.

      I would recommend just letting the code parse out each table that it encounters. At the finish_current_table() subroutine, make a decision of whether or not you want to actually keep the current table or not? I just hard coded a regex for /2017.*?Fp379/ but of course this could be more flexible. Note that to "keep" the table, I added it to a @results data structure, which I "dumped" right before the program ends. I would presume that in the "real code", instead of adding to the @results structure, some export() function is called to put the table into a DB or make a discrete file in some sort of CSV format? I did not generate strictly conformant CSV (multi-word strings should be quoted).

      From the size of the input file you are describing, it sounds to me like putting these tables into a SQL DB is the right way to go. The Perl DBI is fantastic.

      Code:

        thank you Marshall , I see the code you sent was of great use to me, but the table gets parsed line by line(row wise) but I want arrays of columns so that it will be easy to compare similar columns, also I have some header for the table I want to store how can I make it possible e.g:

        1)in the table here I want an array locker which should contain all the values in the column,

        2)also in the given table as you can see there are blanks which mean they are same as the value previously present in the column, is it possible to repeat the same value as previous for the blanks and also there is a header which contains time etc ,

        3)as you can see there are 11 rows here I want an array which has time and repeated 11 times (number of rows) and similarly for sequence and range .

        4)I want to use key word 1349F.63 here to find the similar tables (there are other tables with heading as "position log table"but with different extension),

        5)from the first line I want to extract the 4th value ie in this case 1349F.63.

        6)I see you are using last line before the table starts say I want to look at 13th line before the table to decide which particular table I want to store (and also store those 13 header lines in the format mentioned above)

        7) I don't want to print all the tables I want to print only the tables which have the key word say "1349F.63" in this case prints all position log table corresponding to the extension

        Position log table 1349F.63 time 10:23:66 sequence = 39 range = 6678 | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 5| | | | | | | | | | 6| | | | | | | | | | 7| | | | | | | | | | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 5| | | | | | | | | | 6| | | | | | | | | | 7| | | | | | | | | | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|
        thanks for the help