I have a huge text file which contains around 50000 records.
Now if each record was limited to one line, it would be great, but each record is actually spread around multiple lines.
The data format is
Data Value : (some number here)
Data Group : (Some string here)
Data RAW : (some text optionally followed by \ if the line gets too lo
+ng)
A sample may be
Data_Value -3455
Data_Group Clock453
Data_raw -from point_q/point2 \
-of point_a/abc/Q/D \
.
.
-to endpoint/abc/CK
So you get the gist. The data raw can be actually 100s of lines long, so its a really huge file to process. However thankfully I don't need to store all the data.
All I need to do is that for each record create a hash entry which does
Hash.Data_value=<sometthing>
hash.Data_Group=<something>
Hash.Data_line=<Line number of the data_raw field>
So I create this huge hash, and then I do various stuff using the line number as the key.
For example, this record file is put through a processor, which lists some data points as insignificant.
So the output file is
Line no : <line number of data_raw field of insignificant data>
Data_processed_output -from blah blah
. \
. \
-to blah blah
The post processing is easy and I have this worked out. All I need is to create some metrics.
For example iterate of the group as key, and total the data value for groups and show to user
For example
A table like this
Data Group | Total Value Maximum Value | Minumum value
Then I can create a hash from the processed file and simply ignore the line number keys which were termed as insignificant and then create the above table.
I have the second processing part worked out, but reading in the huge file with multi line records is nixing me.
I want to read it in fast
The read in procedure is simple, I start reading in the file, ignore all lines till I hit the light which says
"Data_value" and store the value in hash
Then I go to next line and store the value in the hash
Then I go to next line and store the line number in a hash, and then I ignore all subsequent lines till I see data_value again.
Please suggest a fast way!
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.