in reply to Re^6: Parsing .txt into arrays
in thread Parsing .txt into arrays

Take a look again at the code I wrote at Re^3: Parsing .txt into arrays. The variable "$name" is a string variable with all of the table "name" lines included. If you only want some of the lines in that variable, probably Perl regex (regular expression) is your solution. See: http://www.perlmonks.org/index.pl/Tutorials#Pattern-Matching-Regular-Expressions-and-Parsing

Replies are listed 'Best First'.
Re^8: Parsing .txt into arrays
by Fshah (Initiate) on Jun 12, 2017 at 04:37 UTC
    hi Marshall, I was hoping not to seek your help unless it was utterly necessary, I figured out how to transpose and replicate values for empty columns, but when I put this code to test I was facing few practical issues with the code you modified for me,
    2017 Position log :Fp379 place: cal time: 23:01:45 | | |Pos |value | |bulk|lot| prev| newest| |# |Locker|(dfg) |(no) |nul|val |Id | val |val | ----------------------------------------------------------- | 0| 1| 302832| -11.88| 1| 0|Pri| 16| 0| | 1| 9| 302836| 11.88| 9| 0|Pri| 10| 0| | 2| 1| 302832| -11.88| 5| 3|Pri| 14| 4| | 3| 3| 302833| 11.88| 1| 0|sec| 12| 0| | 4| 6| 302837| -11.88| 1| 0|Pri| 16| 3| language data: time= 24hrs |no.| name | languages | proficiency | time taken| |_ _| _ _ _| _ _ _ _ _ |_ _ _ _ _ _ _| _ _ _ _ _ | |1 | eliz | English | good | 24 hrs | |2 | susan| Spanish | good | 13 hrs | |3 | danny| Italian | decent | 21 hrs | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name | age | place | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |33 | NY | 2015 | |2 | mark |28 | cal | 2106 | 2017 Review log :Gt149 place: NY time: 13:31:15 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |no.| name |level | dist | year | |_ _|_ _ _ _|_ _ _ | _ _ _ | _ _ | |1 | sue |96 | Gl | 2015 | |2 | mark |67 | Yt | 2106 |
    the problem I have is that there isn't just one table under a header ,there are multiple tables(of different sizes) under each header as shown, by observation I found out that the page (header+tables) ends only when we find year(2017) in the next line ,the position log here has 3 tables and each with its own header and it ends only when we see the next "2017" (Review log) PS: max tables in a page are 4.
      Ok, you are transforming one data representation into another, a very common task for Perl. In your data above, I see that there are 2 collections of data, "2017 Position log :Fp379" which contains 3 tables and "2017 Review log :Gt149" which contains one table. What format does the data need to be in for whatever program consumes the output of your program? In other words, what happens to your output? Where does it go and what does that thing that it goes to do with it?

      There are many ways to express the idea that N tables belong to a single "record". Ultimately what you generate will need to be parsed and understood by something else. Can you explain more?

      Update: I think that the subroutine, finish_current_table() that decides which tables to "keep" would need to be modified. Perhaps with some state flag variable that indicates that we are within some 2017 year record? You keep talking about "pages". If you mean that these "pages" are separated by a form-feed (\f), that could potentially simplify the parsing situation. We could read an entire page at a time, then decide to keep or not the tables on that page? I personally don't like code or formats that depend upon "invisible" characters like \t or \f. But this could potentially be of help to simply the code. I am unsure. In any event, your code appears to be intended to transform a human understandable thing into a computer understandable thing. More detail about what this "computer understandable thing" is is appropriate.

      Update: with your extra example DATA:

        hi Marshall, thanks for the reply and yes I need exactly what your code produces but with few changes, before that I want to let you know of what I'm doing or what output I'm expecting and what I want to do of it, I have a text file which is of format I've uploaded (although dummy data) this data pertains to some real time experiment I want to analyze this file using plots ,what I want perl to do is to remove these delimiters and extract particular type of pages and export it to a text file from where I'll use some other language (matlab) to parse this output file to extract some outputs.

        Back to the modification I need , the code prints all the tables (awesome) but I need only specific page and if possible I need each table into a different array (explanation followed)

        TABLE: place and year data: 67 Record_Start: 3 Record_End: 10 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: work and language :65 Record_Start: 12 Record_End: 19 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Position log Record_Start: 20 Record_End: 30 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 31 Record_End: 44 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: Position log table 1349F.63 time 10:23:66 sequence = 39 range = 6678 Record_Start: 46 Record_End: 67 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 5,,,,,,,, 6,,,,,,,, 7,,,,,,,, 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 5,,,,,,,, 6,,,,,,,, 7,,,,,,,, 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 Record_End: 82 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Record_Start: 91 Record_End: 95 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 2017 Review log :Gt149 place: NY time: 13:31:15 Record_Start: 96 Record_End: 104 COLUMNS: no.,name,level,dist,year 1,sue,96,Gl,2015 2,mark,67,Yt,2106 TABLE: Record_Start: 105 Record_End: 111 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: work and language :65 Record_Start: 113 Record_End: 119 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs

        Above is what your code produces but what I need is only(below) given the keyword Fp379

        TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 Record_End: 82 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Record_Start: 91 Record_End: 95 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106

        more over it would make my parsing easier if each of these tables would have different arrays, as I'm converting the header also as another field in the array e.g: a column with name "record start" and for a particular table all rows have same value under it

        sadly there isn't any form feed separating the pages there's a double blank line between pages but this is the same for few tables too, so I think end of the page can be found only by seeing the year in the next line(start of next page)

        One more important thing is that the program takes a lot of time to give out output considering the size of the file almost(1.5Gb), Will it not speed things up if we just read only the required pages won't it speed things up, its taking me more than 5 min to parse a single extension ,I'd desperately request you to suggest a way to reduce the time

        All I want is to look for a keyword and start printing all the headers + tables (as you are doing now ) until we find year name in the next line