Re: Reading HUGE file multiple times

Hi,

I am dealing just about every day with somewhat similar problems on huge data files, and I am fairly confident that it should be possible to read the file only once (or at most twice), but you don't give enough information about the structure of the file.

Is my understanding correct that you first have a bunch of identifier lines (1000+), and then you data lines? And the identifier lines some how give the rules as to what to do with the data lines? Or do you have one identifier line giving information about what to to on the next data line or next data lines?

Please tell us more about the identifiers: do they say on which data line numbers to do something? Or which field to extract in the data line?

In all cases, I believe that it should most probably be possible to read your file sequentially only once, record what you have in the identifier line and use that for processing the data lines coming afterwards. But I can't say more on how to do it without a better idea of your data format or, even better, a simplified sample of your file content together with some explanation on how to use the identifiers to analyze the data lines.

Comment on Re: Reading HUGE file multiple times

Replies are listed 'Best First'.
Re^2: Reading HUGE file multiple times by Anonymous Monk on Apr 28, 2013 at 12:42 UTC
Hi there, Thanks for the tips. My data looks something like >ID Data (a verrry long string of varying length in a single line) >ID again Data again Indexing might be a good idea. Maybe I could only read the IDs (skipping the next line) and then when accessing just add +1 to the index? I need to extract them twice in a code in different subroutines and each time the subroutine specifies what to do with them. I don't know if it is a good idea to store it all in a hash. I only need to extract a fragment of the data in first read and the whole data entry in the other. I don't have the IDs in advance, the suroutine specifies which one I need and what to do with it. I've tried `$Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof();` but it takes very long time to do. I wonder if there is a better way to do it since this would be a bottleneck.	[reply] [d/l]
Re^3: Reading HUGE file multiple times by BrowserUk (Patriarch) on Apr 28, 2013 at 13:30 UTC
I've tried `$Library_Index{<$Library>} = tell(ARGV), scalar <$Library> until eof();`but it takes very long time to do. How long? I wonder if there is a better way to do it since this would be a bottleneck. If you are re-using the file many times you could construct the index and write it to a separate file. It should be substantially quicker to read the index from a file than to construct it from scratch each time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Reading HUGE file multiple times by Anonymous Monk on Apr 28, 2013 at 13:36 UTC
It's over 10 min so I kill the process. I think the reason is it's writing data line as a hash name and the data line can have 300.000 characters. I changed it so it only reads the index of the ID and then will just add 1 to it when I need the data. Then it's done in couple of seconds. Thanks for the tips with the index.	[reply]
Re^5: Reading HUGE file multiple times by BrowserUk (Patriarch) on Apr 28, 2013 at 13:45 UTC
Re^6: Reading HUGE file multiple times by Anonymous Monk on Apr 28, 2013 at 14:16 UTC
Some notes below your chosen depth have not been shown here
Re^4: Reading HUGE file multiple times by Anonymous Monk on Apr 28, 2013 at 14:06 UTC
I tried the new code and it works really fast. Problem is there is an error with tell and it's all -1. From Dumper `$VAR32564 = -1; $VAR32565 = '>ENST00400413799 ';` [download]	[reply] [d/l]
Re^2: Reading HUGE file multiple times by Anonymous Monk on Apr 28, 2013 at 12:45 UTC
Forgot to mention - The ID is only used to identify the data. So basically I know I need ID_1 and I extract the corresponding data set. Afterwards the code has different variables on how to modify it, which are not related to the ID.	[reply]