Each table has 3 parts: 1)the name of the table, 2)the column definitions of the table, and 3)the data for each row in the table. All 3 of your example tables have these same 3 parts.
The code below cycles through 3 states:
So anyway, the thinking goes: if we are in Phase 1,2,3 and the line that we just read means that the current Phase has not ended, then we process the current line for the current Phase. Otherwise, the current Phase ends, any "clean-up" is done and the overall state transistions to the next Phase.
Below, the state transitions are 1->2->3->1->2->3->1, etc.
I wrote the code and it worked for the first 2 tables, then I found out that something was odd about table 3. So I used a special technique in Perl to code an exception to the rule of what ends the "finding Table Name" phase.
In Perl the "redo" statement restarts the while (condition){...} loop without re-evaluating the condition. In this case, we see that the COL_NAMES phase has already started. So I just adjust the Phase or State to be 'GET_COL_NAMES' and restart the loop without reading another line. There are of course other ways of accomplishing this same goal. This techniqe just happened to surface at the moment.
I didn't worry about tweaking the splits or regex'es. Often this just doesn't matter as disk I/O is usually the slowest part.
The main thing I wanted to show in this post was a method to section the code into easy to identifiable states or phases. Some details of how each 'state' is handled could be different, but that is not my main point.
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @results=(); # this is Array of Array, # [$table_name, [@data]] # row[0] of @data contains the column names ################ my $state; my $name; my @data; my @col_names; sub start_new_table_entry { $state = 'GET_TABLE_NAME'; $name = ""; @data=(); @col_names=(); } sub finish_current_table { if ($state ne 'GET_TABLE_NAME') { unshift @data,[@col_names]; push @results,[$name,[@data]]; $state = 'GET_TABLE_NAME'; } } start_new_table_entry(); REDO_LINE: while (my $line = <DATA>) { $line =~ s/^\s*//; # delete leading spaces $line =~ s/\s*$//; # delete trailing spaces # (this includes line endings) if ($state eq 'GET_TABLE_NAME') #### TABLE NAME ### { if ($line =~ /^\|/) # premature start of column name state! Who +a! { # special case of malformed table without # a starting banner of --- or _ _ _ # we are already in the column name state! $state = 'GET_COL_NAMES'; redo REDO_LINE; } elsif ($line !~ /^[-_]/) #keep going - normal case { $name = $line if $line ne ""; # get last non blank line befo +re table $name =~ s/\s*\:\s*\d+$//; # cleans up the name (if any) } else { $state = 'GET_COL_NAMES'; } } elsif ($state eq 'GET_COL_NAMES') #### COLUMN NAMES ### { if ($line !~ /(^\|[-_])|(^[-])/ ) #keep going { $line =~ s/^\|\s*//; my @col_name_raw = split /\|/,$line; my $col=0; foreach my $this_col (@col_name_raw) { $this_col =~ s/\s*$//; $this_col =~ s/^\s*//; $col_names[$col]//= ""; $this_col = " $this_col" if ($col_names[$col] ne ""); $col_names[$col++] .= "$this_col"; } } else { $state = "GET_DATA"; } } elsif ($state eq 'GET_DATA') #### DATA ROWS ### { if ( $line =~ /^\|/) #keep going { $line =~ s/^\|\s*//; my @this_data = split /\|/,$line; @this_data = map {s/\s*$//;s/^\s*//;$_}@this_data; push @data,[@this_data]; } else { finish_current_table(); start_new_table_entry(); } } } finish_current_table(); # in case of malformed end of table # dump results in "psuedo" CSV format # also consider looking at: # print Dumper \@results; foreach my $tableref (@results) { my ($name,$dataref) = @$tableref; print "TABLE: '$name'\n"; my $row0 = shift @$dataref; print "COLUMNS: ",join(",",@$row0),"\n"; foreach my $row (@$dataref) { print join(",",@$row),"\n"; } print "\n"; } =PRINTED OUTPUT TABLE: 'place and year data' COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 'work and language' COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: 'Position log' COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 =cut __DATA__ asdfasdf some trash in the file... $$#@more trash place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | some more trash here 123947982374 work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|
In reply to Re: Parsing .txt into arrays
by Marshall
in thread Parsing .txt into arrays
by Fshah
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |