tsk1979 has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge text file which contains around 50000 records. Now if each record was limited to one line, it would be great, but each record is actually spread around multiple lines.

The data format is

Data Value : (some number here) Data Group : (Some string here) Data RAW : (some text optionally followed by \ if the line gets too lo +ng)

A sample may be
Data_Value -3455 Data_Group Clock453 Data_raw -from point_q/point2 \ -of point_a/abc/Q/D \ . . -to endpoint/abc/CK

So you get the gist. The data raw can be actually 100s of lines long, so its a really huge file to process. However thankfully I don't need to store all the data.

All I need to do is that for each record create a hash entry which does

Hash.Data_value=<sometthing> hash.Data_Group=<something> Hash.Data_line=<Line number of the data_raw field>

So I create this huge hash, and then I do various stuff using the line number as the key.

For example, this record file is put through a processor, which lists some data points as insignificant.
So the output file is
Line no : <line number of data_raw field of insignificant data> Data_processed_output -from blah blah . \ . \ -to blah blah
The post processing is easy and I have this worked out. All I need is to create some metrics.
For example iterate of the group as key, and total the data value for groups and show to user
For example
A table like this
Data Group | Total Value Maximum Value | Minumum value

Then I can create a hash from the processed file and simply ignore the line number keys which were termed as insignificant and then create the above table.

I have the second processing part worked out, but reading in the huge file with multi line records is nixing me. I want to read it in fast

The read in procedure is simple, I start reading in the file, ignore all lines till I hit the light which says "Data_value" and store the value in hash Then I go to next line and store the value in the hash Then I go to next line and store the line number in a hash, and then I ignore all subsequent lines till I see data_value again.

Please suggest a fast way!

Replies are listed 'Best First'.
Re: Best way to read and process a text file with multiline records
by ELISHEVA (Prior) on May 21, 2009 at 07:20 UTC

    I usually use state and check for the end of line continuation character - something like this (pseudocode):

    my $line=''; while (<INPUT>) { # build line chomp; if (/^(.*)\\$/) { $line .= $1; #or maybe $line .= "$1 "; ??? next; } $line .= $_; # ... do line processing here ... if ($line =~ /^Data_Value/) { ... } elsif ($line =~ /^Data_Group/) { ... } elsif ($line =~ /^Data_Raw/) { ... } # reset state $line = ''; }

    The advantage of this approach is that it keeps the mechanics of reading in the data separate from the logic of processing the data. In the processing section you can get as fancy as you need without worrying about how it relates to the file's method of handling continuation lines.

    Best, beth

      This is a nice technique.
Re: Best way to read and process a text file with multiline records
by Utilitarian (Vicar) on May 21, 2009 at 07:19 UTC
    Not sure if this is what you're asking, but something like the following would create the data structure I think you're looking for fairly quickly. Warning: totaly untested, modify 'til it works ;),
    my $record_line=0; my @records; my $index=0; while(<$DATA>){ if ($record_line != 2){ if(/^Data_Raw/){ $records[$index]->{Data_Line}=$.; $record_line=2; }elsif(/^Data_group/){ $record_line=1; my($title,@value)=split; $records[$index]->{Data_Group}=join(,' ',@value)); } }elsif(/^Data_Value/){ $index++; my($title,@value)=split; $records[$index]->{Data_value}=join(,' ',@value)); $record_line=0; } }
      This seems very quick! I will work upon this to create my hash!
      Since line continuation character is only in data_raw, and actual data is not important, just the line number is important, I do not need to bother with the "\" character!
Re: Best way to read and process a text file with multiline records
by targetsmart (Curate) on May 21, 2009 at 07:07 UTC
    Please suggest a fast way!
    good that you showed your algorithm, but show some code you actually did , which needs improvement!.

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
      Below I present some code which I have currently finalized.
      The way of reading multiline record using prevline, and prevline2 is not really original, and I saw a colleague of mine doing this.

      It looks very very simple and works, and I am currently writing the complete script with processing and all.
      while (<INDATA>) { chomp; my $linenum = $.; if (/^data_raw/) { $indata{INFILE_RAW}{$linenum}{VALUE} = (split(/\s+/,$prevl +ine))[-1]; $indata{INFILE_RAW}{$linenum}{GROUP} = (split(/\s+/,$prevl +ine2))[-1]; } $prevline2 = $prevline; $prevline = $_; }

      The main reason I do like this, because I am inputting multiple files here. The governing key for all those files will be the line number. In the end the user will be presented statistics like this
      Group name | Total Value ABC 500 DEF 800 . .

      Now another file I will input will just contain the line numbers of invalid data items. So user will be shown another table which will be same as above, but with invalids removed. Another table like the above will be shown which will be for the "invalids" file. In all this, the key is the line number which identifies what to subtract, the line number of the data_raw to be precise.
Re: Best way to read and process a text file with multiline records
by Bloodnok (Vicar) on May 21, 2009 at 13:25 UTC
    This seems to fit the bill ... unless I'm missing something:
    use warnings; use strict; use Data::Dumper; my (%result, $subhash); while (<DATA>) { if (/^Data_Value/ ... /^Data_raw/) { my @value = split; my $value = $value[0] = ''; $value = "@value"; if (/^Data_Value/) { $subhash = $value; $result{$subhash} = {}; } $result{$subhash}->{Data_Group} = $value if /^Data_Group/; $result{$subhash}->{Data_raw} = $. if /^Data_raw/; } } print Dumper \%result; __DATA__ Data_Value -999 Data_Group Clock453 Data_raw -from point_q/point2 \ -of point_a/abc/Q/D \ . . -to endpoint/abc/CK Data_Value -123 Data_Group Clock453 Data_raw -from point_q/point2 \ -of point_a/abc/Q/D \ . . -to endpoint/abc/CK Data_Value -666 Data_Group Clock453 Data_raw -from point_q/point2 \ -of point_a/abc/Q/D \ . . -to endpoint/abc/CK
    user@unforgiven:~$ perl tst.pl $VAR1 = { ' -123' => { 'Data_Group' => ' Clock453', 'Data_raw' => '10' }, ' -999' => { 'Data_Group' => ' Clock453', 'Data_raw' => '3' }, ' -666' => { 'Data_Group' => ' Clock453', 'Data_raw' => '17' } };
    A user level that continues to overstate my experience :-))
      Hmm the use of ellipses (...)
      This seems to be quite interesting. I have never used this to parse through multiple records

      With perl I learn everything new each day, and to consider that I have been programming for past 2 years! </p Thanks

Re: Best way to read and process a text file with multiline records
by CountZero (Bishop) on May 21, 2009 at 14:00 UTC
    How fast/slow is your program? Does it take hours, minutes or seconds to process the file? Is it a time critical application?

    These are questions you should ask yourself before starting to prematurely optimise your program.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Best way to read and process a text file with multiline records
by Anonymous Monk on May 21, 2009 at 07:12 UTC
    What you described sounds very fast.