Fshah has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I have a large log file in .txt format which has several pages (each page has different format,with different colums pertaining to different fields) I want to separate each column into an array so that it can be used to process later ,

place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|

I want these columns into an array with name as given in the table containing values as in the table. thank you.

Replies are listed 'Best First'.
Re: Parsing .txt into arrays
by hippo (Archbishop) on May 24, 2017 at 09:51 UTC

    There seem to be three tasks here:

    1. Split the file up into its constituent "pages" so that each set of data can be parsed and processed separately.
    2. Split each row up into its constituent "columns"
    3. Push each datapoint onto the correct array

    Which of these is giving you trouble? What have you tried? How did it fail? An SSCCE is always welcome.

      that's right hippo,

      I'm still stuck on 1 ,firstly I want to spit up this txt file into pages (each page pertaining to specific format),as similar pages have same keywords eg: work and language table always has the same format (same columns ,although not the same number of rows),I want to search these tables in the entire txt file using the key word "work and languages" and populate each work and languages table into different arrays,How can I go about this?

      in this entire description I've used pages and tables interchangeably .

        So lets focus on hippo's step 1 first, since it is the key to the next steps

        In some ways this is easy, and in other ways it will be hard

        If i was to do this based on what you have shown us i would start with a base you have already identified, in one case the a table begins with "work and language :65" , another table begins with "place and year data: 67" and yet another with "Position log".

        This takes that concept, and uses whats sometimes called a state machine to separate the lines into table parts, i then kept going to parse all the data into a hash of arrays of hashs. I realize its not quite the output style you wanted but it shows a lot of the techniques and you could modify it to get what you want.

        use strict; use warnings; my $state=''; my %tables; my @titles; while (my $line=<DATA>) { chomp $line; if (-1 != index($line,'place and year data: 67')) {$state='place' +;@titles=();} elsif (-1 != index($line,'work and language :65')) {$state='work'; +@titles=();} elsif (-1 != index($line,'Position log')) {$state='positi +on';@titles=();} elsif (-1 != index($line,'|')){ $line=~s/^\s*//; # take off leading spaces $line=~s/\s*$//; # take off trailing spaces $line=~s/^\|//; # take off leading bar my @thisset=split('\|',$line); for my $part (@thisset){ $part=~s/_//g; # remove any underscores $part=~s/^\s*//; # take off leading spaces $part=~s/\s*$//; # take off trailing spaces } unless ($thisset[0]=~m/^\d+$/) { # if first not digits this is a title part my $ix=0; for my $part (@thisset) { if ($part ne '') { unless (defined($titles[$ix])) {$titles[$ix]=$part; } else {$titles[$ix].=' '.$part; } } $ix++; } # part } # not digits else { # first is digits so this is data my %hashpart; my $ix=0; for my $part (@thisset) { $hashpart{$titles[$ix]}=$part; $ix++; } # part push @{$tables{$state}},\%hashpart; } # digits } # not sep } # line use Data::Dumper; print Dumper(\%tables); __DATA__ place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|
        Result
        $VAR1 = { 'work' => [ { 'languages' => 'English', 'no.' => '1', 'name' => 'eliz', 'time taken' => '24 hrs', 'proficiency' => 'good' }, { 'no.' => '2', 'languages' => 'Spanish', 'name' => 'susan', 'proficiency' => 'good', 'time taken' => '13 hrs' }, { 'name' => 'danny', 'time taken' => '21 hrs', 'proficiency' => 'decent', 'languages' => 'Italian', 'no.' => '3' } ], 'place' => [ { 'year' => '2015', 'place' => 'NY', 'name' => 'sue', 'no.' => '1', 'age' => '33' }, { 'year' => '2106', 'name' => 'mark', 'place' => 'cal', 'no.' => '2', 'age' => '28' } ], 'position' => [ { 'newest val' => '0', 'prev val' => '16', 'bulk val' => '0', 'value (no)' => '-11.88', 'Locker' => '1', '#' => '0', 'nul' => '1', 'Pos (dfg)' => '302832', 'lot Id' => 'Pri' }, { 'newest val' => '0', 'bulk val' => '0', 'prev val' => '10', 'Locker' => '9', 'value (no)' => '11.88', 'nul' => '9', '#' => '1', 'lot Id' => 'Pri', 'Pos (dfg)' => '302836' }, { 'lot Id' => 'Pri', 'Pos (dfg)' => '302832', 'newest val' => '4', 'bulk val' => '3', 'prev val' => '14', 'Locker' => '1', 'value (no)' => '-11.88', 'nul' => '5', '#' => '2' }, { 'nul' => '1', '#' => '3', 'Locker' => '3', 'value (no)' => '11.88', 'bulk val' => '0', 'prev val' => '12', 'newest val' => '0', 'lot Id' => 'sec', 'Pos (dfg)' => '302833' }, { '#' => '4', 'nul' => '1', 'value (no)' => '-11.88', 'Locker' => '6', 'bulk val' => '0', 'prev val' => '16', 'newest val' => '3', 'Pos (dfg)' => '302837', 'lot Id' => 'Pri' } ] };

Re: Parsing .txt into arrays
by Eily (Monsignor) on May 24, 2017 at 09:42 UTC

    Sample data would have been welcome. See How do I post a question effectively?.

    Either your columns are separated by some given delimiter (eg: comma), in which case you can use Text::CSV, or your columns are aligned on a given position, in which case unpack can do the trick (although there might be a module that does it better)

    use Data::Dump qw(pp); pp unpack "A5 A6 A2 A*", "Hi Hello World"; __DATA__ ("Hi", "Hello", "Wo", "rld")

Re: Parsing .txt into arrays
by Marshall (Canon) on May 25, 2017 at 08:53 UTC
    I will show you some different coding of a state machine parser for your example file which contains 3 tables.

    Each table has 3 parts: 1)the name of the table, 2)the column definitions of the table, and 3)the data for each row in the table. All 3 of your example tables have these same 3 parts.

    The code below cycles through 3 states:

    1. GET_TABLE_NAME
    2. GET_COL_NAMES
    3. GET_DATA
    One general method for writing code like this is to decide how you tell when each phase or state of the data capture "ends". It is often a bit easier if the line that signals the "end of a record or phase" is a distinct thing from a line that signals "start of a new record or phase". In your test case, the first 2 tables comply with this, but the third table does not because there is no '------' banner right before the column names (that would signal end of the Table Name capture). I suspect that this was a typo, but I allowed for it anyway(see below).

    So anyway, the thinking goes: if we are in Phase 1,2,3 and the line that we just read means that the current Phase has not ended, then we process the current line for the current Phase. Otherwise, the current Phase ends, any "clean-up" is done and the overall state transistions to the next Phase.

    Below, the state transitions are 1->2->3->1->2->3->1, etc.

    I wrote the code and it worked for the first 2 tables, then I found out that something was odd about table 3. So I used a special technique in Perl to code an exception to the rule of what ends the "finding Table Name" phase.

    In Perl the "redo" statement restarts the while (condition){...} loop without re-evaluating the condition. In this case, we see that the COL_NAMES phase has already started. So I just adjust the Phase or State to be 'GET_COL_NAMES' and restart the loop without reading another line. There are of course other ways of accomplishing this same goal. This techniqe just happened to surface at the moment.

    I didn't worry about tweaking the splits or regex'es. Often this just doesn't matter as disk I/O is usually the slowest part.

    The main thing I wanted to show in this post was a method to section the code into easy to identifiable states or phases. Some details of how each 'state' is handled could be different, but that is not my main point.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @results=(); # this is Array of Array, # [$table_name, [@data]] # row[0] of @data contains the column names ################ my $state; my $name; my @data; my @col_names; sub start_new_table_entry { $state = 'GET_TABLE_NAME'; $name = ""; @data=(); @col_names=(); } sub finish_current_table { if ($state ne 'GET_TABLE_NAME') { unshift @data,[@col_names]; push @results,[$name,[@data]]; $state = 'GET_TABLE_NAME'; } } start_new_table_entry(); REDO_LINE: while (my $line = <DATA>) { $line =~ s/^\s*//; # delete leading spaces $line =~ s/\s*$//; # delete trailing spaces # (this includes line endings) if ($state eq 'GET_TABLE_NAME') #### TABLE NAME ### { if ($line =~ /^\|/) # premature start of column name state! Who +a! { # special case of malformed table without # a starting banner of --- or _ _ _ # we are already in the column name state! $state = 'GET_COL_NAMES'; redo REDO_LINE; } elsif ($line !~ /^[-_]/) #keep going - normal case { $name = $line if $line ne ""; # get last non blank line befo +re table $name =~ s/\s*\:\s*\d+$//; # cleans up the name (if any) } else { $state = 'GET_COL_NAMES'; } } elsif ($state eq 'GET_COL_NAMES') #### COLUMN NAMES ### { if ($line !~ /(^\|[-_])|(^[-])/ ) #keep going { $line =~ s/^\|\s*//; my @col_name_raw = split /\|/,$line; my $col=0; foreach my $this_col (@col_name_raw) { $this_col =~ s/\s*$//; $this_col =~ s/^\s*//; $col_names[$col]//= ""; $this_col = " $this_col" if ($col_names[$col] ne ""); $col_names[$col++] .= "$this_col"; } } else { $state = "GET_DATA"; } } elsif ($state eq 'GET_DATA') #### DATA ROWS ### { if ( $line =~ /^\|/) #keep going { $line =~ s/^\|\s*//; my @this_data = split /\|/,$line; @this_data = map {s/\s*$//;s/^\s*//;$_}@this_data; push @data,[@this_data]; } else { finish_current_table(); start_new_table_entry(); } } } finish_current_table(); # in case of malformed end of table # dump results in "psuedo" CSV format # also consider looking at: # print Dumper \@results; foreach my $tableref (@results) { my ($name,$dataref) = @$tableref; print "TABLE: '$name'\n"; my $row0 = shift @$dataref; print "COLUMNS: ",join(",",@$row0),"\n"; foreach my $row (@$dataref) { print join(",",@$row),"\n"; } print "\n"; } =PRINTED OUTPUT TABLE: 'place and year data' COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 'work and language' COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: 'Position log' COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 =cut __DATA__ asdfasdf some trash in the file... $$#@more trash place and year data: 67 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | some more trash here 123947982374 work and language :65 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | Position log | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|
      thank you marshall, I really appreciate your response , my problem now is to extract same type of tables lets say position log table here, as all the position log tables have same format, i want to know the start and end line numbers of these tables ,how can I do that mind you I have a huge text file order of gb's I want to extract similar tables and export them so that I can code them table specific (also table doesn't contain just a table it has some details, here place and time %I want to store these as well % )

      approach: say every position log table has extension Fp379 and all pages start with the year ,I want to use these keywords%Fp379 for position log page and 2017 for all pages % to separate the required tables

      like wise similar pages start with same extension.
      2017 Position log :Fp379 place: cal time: 23:01:45 | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3|

        Fshah:

        The $. variable contains the line number last read from the last filehandle access. So you can simply store the value each time you start a new array and add the starting line number to your table of information. You can read more about the $. variable at perlvar.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        Hi Fshah,
        Ok, for these extra requirements, I modified the GET_NAME state to allow for multi-line names instead of just keeping the last non-blank line before the table starts. Keeping track of the line numbers from the original file sounds weird, but I added that info to the $name record using $., the current file handle's current line number.

        I would recommend just letting the code parse out each table that it encounters. At the finish_current_table() subroutine, make a decision of whether or not you want to actually keep the current table or not? I just hard coded a regex for /2017.*?Fp379/ but of course this could be more flexible. Note that to "keep" the table, I added it to a @results data structure, which I "dumped" right before the program ends. I would presume that in the "real code", instead of adding to the @results structure, some export() function is called to put the table into a DB or make a discrete file in some sort of CSV format? I did not generate strictly conformant CSV (multi-word strings should be quoted).

        From the size of the input file you are describing, it sounds to me like putting these tables into a SQL DB is the right way to go. The Perl DBI is fantastic.

        Code:

Re: Parsing .txt into arrays
by karthiknix (Sexton) on May 24, 2017 at 10:44 UTC

    - First collect data of one page and insert into an array with new line character as delimiter. Likewise you must have number of arrays equal to number of pages in Log file.

    - use map function to traverse each array and use pattern matching to delimit a column value and push into an array using regex special variable.

     map{if(/(.*)<Column Delimiter>(.*)/ig){<your code to assign each value into an array>}}@arr

    - Name your arrays for each column based on its value for you to identify it later part of the code. Now you have sorted list of data in arrays.

Re: Parsing .txt into arrays
by thanos1983 (Parson) on May 24, 2017 at 11:24 UTC

    Hello Fshah

    Welcome to the monastery gates and into the beautiful world of Perl.

    Given from your sample of data this looks like DB data. Are you sure you want to export them into text files?

    Through a very simple script you can export only the required fields and data from your database. Is this something that you would like?

    It is very common that all of us we are trying to do something that some times is more complicated, try to describe in simple words how are you data populated into your txt files, is it through a Data Base? If this is the case then you should not be doing something like that. :D

    Just describe what you are trying to do, Monks here are very friendly and very very experienced, they/we will try to assist as much as possible.

    Looking forward to your update.

    Hope this helps.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
      hello thanos1983,

      this is a sample data I made up looking at the format of the original data, the original data is simply a txt file from unknown source so all I have is this txt file and I want to populate my arrays with name as the heading appearing in the table and the contents as the columns pertaining to the heading ,eg: for work and language table, I want an array(with name "languages")and its contents as

      $languages[1]=english $languages[2]=spanish $languages[3]=italian

Re: Parsing .txt into arrays
by Anonymous Monk on May 24, 2017 at 17:58 UTC
    Someone is paying you to do this?
      No!! my personal interest.