in reply to Re^9: Parsing .txt into arrays
in thread Parsing .txt into arrays

hi Marshall, thanks for the reply and yes I need exactly what your code produces but with few changes, before that I want to let you know of what I'm doing or what output I'm expecting and what I want to do of it, I have a text file which is of format I've uploaded (although dummy data) this data pertains to some real time experiment I want to analyze this file using plots ,what I want perl to do is to remove these delimiters and extract particular type of pages and export it to a text file from where I'll use some other language (matlab) to parse this output file to extract some outputs.

Back to the modification I need , the code prints all the tables (awesome) but I need only specific page and if possible I need each table into a different array (explanation followed)

TABLE: place and year data: 67 Record_Start: 3 Record_End: 10 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: work and language :65 Record_Start: 12 Record_End: 19 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Position log Record_Start: 20 Record_End: 30 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 31 Record_End: 44 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: Position log table 1349F.63 time 10:23:66 sequence = 39 range = 6678 Record_Start: 46 Record_End: 67 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 5,,,,,,,, 6,,,,,,,, 7,,,,,,,, 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 5,,,,,,,, 6,,,,,,,, 7,,,,,,,, 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 Record_End: 82 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Record_Start: 91 Record_End: 95 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: 2017 Review log :Gt149 place: NY time: 13:31:15 Record_Start: 96 Record_End: 104 COLUMNS: no.,name,level,dist,year 1,sue,96,Gl,2015 2,mark,67,Yt,2106 TABLE: Record_Start: 105 Record_End: 111 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106 TABLE: work and language :65 Record_Start: 113 Record_End: 119 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs

Above is what your code produces but what I need is only(below) given the keyword Fp379

TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 Record_End: 82 COLUMNS: #,Locker,Pos (dfg),value (no),nul,bulk val,lot Id,prev val,ne +west val 0,1,302832,-11.88,1,0,Pri,16,0 1,9,302836,11.88,9,0,Pri,10,0 2,1,302832,-11.88,5,3,Pri,14,4 3,3,302833,11.88,1,0,sec,12,0 4,6,302837,-11.88,1,0,Pri,16,3 TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 COLUMNS: no.,name,languages,proficiency,time taken 1,eliz,English,good,24 hrs 2,susan,Spanish,good,13 hrs 3,danny,Italian,decent,21 hrs TABLE: Record_Start: 91 Record_End: 95 COLUMNS: no.,name,age,place,year 1,sue,33,NY,2015 2,mark,28,cal,2106

more over it would make my parsing easier if each of these tables would have different arrays, as I'm converting the header also as another field in the array e.g: a column with name "record start" and for a particular table all rows have same value under it

sadly there isn't any form feed separating the pages there's a double blank line between pages but this is the same for few tables too, so I think end of the page can be found only by seeing the year in the next line(start of next page)

One more important thing is that the program takes a lot of time to give out output considering the size of the file almost(1.5Gb), Will it not speed things up if we just read only the required pages won't it speed things up, its taking me more than 5 min to parse a single extension ,I'd desperately request you to suggest a way to reduce the time

All I want is to look for a keyword and start printing all the headers + tables (as you are doing now ) until we find year name in the next line

Replies are listed 'Best First'.
Re^11: Parsing .txt into arrays
by Marshall (Canon) on Jun 14, 2017 at 20:56 UTC
    Hi Fshah,

    Well unfortuately, your first Perl project is quite a big one. I also gather from your questions that you have very little prior programming experience which I'm sure makes this all the more difficult for you.

    The purpose of PerlMonks is for you to learn Perl. I'm not seeing any attempts at your own code along with your questions.

    I wrote a parser to get you started because it uses several techniques that are way beyond "Class 101, homework #1". I figured that you didn't have a chance at even getting started without substantial help. So I got you started with a big jump. Note that huck also contributed some code for you.

    At this point, I would expect you to be spending a considerable number of hours and even full days trying to understand how the parser works. Learning a new language, especially a large language like Perl is difficult even for experienced programmers. Complex data structures are an advanced topic, but one that you need to learn more about. Consider. Perl Data Structures, and Data Types and Variables in the Monks' tutorials section.

    My code parses each and every table it encounters. In Version2, a decision is made in finish_current_table() about whether to keep the table that has just been parsed as part of @results or not? finish_current_table() could be modifed to print the results right away instead of saving to @results. My code concentrated the "dump the results" into a section of code at the "end of the input file", but it doesn't have to be that way. And usually this isn't done that way. As a general principal, don't save things that you can dump/print/save to file/get rid of right away. I was thinking of a "multi-table record" when I wrote the code, but didn't put the decision logic in for that. There are many ways to do this.

    You can have a variable that starts outputing tables when "Fp379" is seen in the first line of the name. And stops after a record is seen with a "blank" name?

    TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 .... TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 .... TABLE: Record_Start: 91 #<<<<- wrong/misleading Record_End: 95 .... <c> Modify print of $name to have a \n if its blank, so that you get: <c> TABLE: 2017 Position log :Fp379 place: cal time: 23:01:45 Record_Start: 69 .... TABLE: language data: time= 24hrs Record_Start: 83 Record_End: 90 .... TABLE: Record_Start: 91 Record_End: 95 ....
    As far as performance goes, if you modify the code such that @results doesn't become "huge" or don't even use @results, the code should run much faster than 5 minutes, even with 1.5 GB input. The speed should be about the rate at which 1.5 GB of lines can be read from the disk. The processing done on each line is very minimal. If you are saving all results to @results, then this could take a "long time" due to virtual memory concerns and disk swapping.

    I don't know why (even if it takes 5 minutes to process every single table in this huge file) that 5 minutes is a problem? As a laugh, I remember one client who complained about a program that took 6 hours to run on my laptop. They were pretty upset about this amount of time (albeit much less on their server). I asked how many times per year do you run this program? Answer: 4. Has the program ever "made a documented mistake", are there any bug reports outstanding? Answer: No. You can imagine how the rest of discussion went...The execution time just didn't matter.

    In this case, I suspect 5 minutes or even 6 hours to process the entire file is just fine. The idea should be to modify the descision making about the tables so that "once is enough". If you need 1,000 table records, do it all at once in one program run instead of running the program 1,000 times. Some more sophisticated decision making in finish_current_table() is needed.

Re^11: Parsing .txt into arrays
by marto (Cardinal) on Jun 14, 2017 at 11:57 UTC