in reply to Process large text data in array

However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria,

Why read it all in -- ie. read every line, allocate space for every line, extend the array to accommodate every line -- if you only need to process the array once?

In other words, why not?:

while( <TEMPFILE> ) { processLine( $_ ) }

Also, you are throwing away performance with how you are passing data back from your subroutines. Eg:

return (%hashed);

That builds a hash in hash_array(), the return statement converts it to a list on the stack; then back at the call site:

my (%trec) = &hash_array(@arr);

You convert that list from the stack back into another hash. Then, you immediately return that hash to the caller of line2rec(), converting it to another list on the stack:

my (%trec) = &hash_array(@arr); return (%trec); }

And then back at that call site, you convert that list back into yet another hash:

my (%trec) = &line2rec($line);

And all of that in order to test if the line contains the string 'active':

if ($trec{'active'})

The whole process can be reduced to (something like; the regex will probably need tweaking to select the appropriate field):

my @data; while( <TEMPFILE> ) { /active/ and push @data, $_; }

It'll be *much* faster.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^2: Process large text data in array
by hankcoder (Scribe) on Mar 10, 2015 at 15:11 UTC

    BrowserUk Thanks for pointing it out. The process not only checking just for "active" value, there are more checking, it is only sample. I built the codes into sub so it is easier for me to refer and debug in future.

    I'm more prefer to use separate sub calling to get the file content instead of using

    while( <TEMPFILE> ) { processLine( $_ ) }

    in every part of codes I'm going to retrieve the file content. I'm taking your notes, will do more test for every codes of it. Thanks.

      Swapping back and forth (and back and forth again) between the hash and a list is still inefficient.

      Use a hash reference instead, so it won't have to make multiple copies of your hash contents.

      I share the opinion that it is quite unnecessary to read a 38MB disk-file into virtual memory in order to process it.   In particular, if when that file becomes, say, “10 times larger than it now is,” your current approach might begin to fail.   It’s just as easy to pass a file-handle around, and to let that be “your input,” as it is to handle a beefy array.   Also consider, if necessary, defining a sub (perhaps, a reference to an anonymous function ...) that can be used to filter each of the lines as they are read:   the while loop simply goes to the next line when this function, say, returns False.

      We know that BrowserUK, in his daily $work, deals with enormous datasets in very high-performance situations.   If he says what he just did about your situation, then, frankly, I would take it as a very well-informed directive to “do it that way.”   :-)

        Thanks sundialsvc4 for the feedback. In my "untested" opinion, would it be good approach to have a check in file reading sub if data size > 30mb (example), then use file-handling method, else if smaller size, read all into memory? I'm just assuming the process would be much faster using array if data is lesser compare to using direct file-handling just to read few lines of data. Correct me if I am wrong. and Yes, I am concerned my approach would fail if data increases several times larger.

        I'm still consider new in using reference vars. So feel free to give me more suggestions of where I should look into. I will slowly re-code older subs into using reference as suggested.

        Thanks again for you guys feedback.