Process large text data in array

hankcoder has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Process large text data in array by BrowserUk (Patriarch) on Mar 10, 2015 at 14:56 UTC
However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria, Why read it all in -- ie. read every line, allocate space for every line, extend the array to accommodate every line -- if you only need to process the array once? In other words, why not?: `while( <TEMPFILE> ) { processLine( $_ ) }` [download] Also, you are throwing away performance with how you are passing data back from your subroutines. Eg: `return (%hashed);` [download] That builds a hash in hash_array(), the return statement converts it to a list on the stack; then back at the call site: `my (%trec) = &hash_array(@arr);` [download] You convert that list from the stack back into another hash. Then, you immediately return that hash to the caller of line2rec(), converting it to another list on the stack: `my (%trec) = &hash_array(@arr); return (%trec); }` [download] And then back at that call site, you convert that list back into yet another hash: `my (%trec) = &line2rec($line);` [download] And all of that in order to test if the line contains the string 'active': `if ($trec{'active'})` [download] The whole process can be reduced to (something like; the regex will probably need tweaking to select the appropriate field): `my @data; while( <TEMPFILE> ) { /active/ and push @data, $_; }` [download] It'll be much faster. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^2: Process large text data in array by hankcoder (Scribe) on Mar 10, 2015 at 15:11 UTC
BrowserUk Thanks for pointing it out. The process not only checking just for "active" value, there are more checking, it is only sample. I built the codes into sub so it is easier for me to refer and debug in future. I'm more prefer to use separate sub calling to get the file content instead of using `while( <TEMPFILE> ) { processLine( $_ ) }` [download] in every part of codes I'm going to retrieve the file content. I'm taking your notes, will do more test for every codes of it. Thanks.	[reply] [d/l]
Re^3: Process large text data in array by SuicideJunkie (Vicar) on Mar 10, 2015 at 15:34 UTC
Swapping back and forth (and back and forth again) between the hash and a list is still inefficient. Use a hash reference instead, so it won't have to make multiple copies of your hash contents.	[reply]
Re^3: Process large text data in array by locked_user sundialsvc4 (Abbot) on Mar 11, 2015 at 02:11 UTC
I share the opinion that it is quite unnecessary to read a 38MB disk-file into virtual memory in order to process it. In particular, if when that file becomes, say, “10 times larger than it now is,” your current approach might begin to fail. It’s just as easy to pass a file-handle around, and to let that be “your input,” as it is to handle a beefy array. Also consider, if necessary, defining a `sub` (perhaps, a reference to an anonymous function ...) that can be used to filter each of the lines as they are read: the `while` loop simply goes to the `next` line when this function, say, returns False. We know that BrowserUK, in his daily `$work`, deals with enormous datasets in very high-performance situations. If he says what he just did about your situation, then, frankly, I would take it as a very well-informed directive to “do it that way.” `:-)`
Re^4: Process large text data in array by hankcoder (Scribe) on Mar 11, 2015 at 04:22 UTC
Re^5: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 10:34 UTC
Re: Process large text data in array by Corion (Patriarch) on Mar 10, 2015 at 14:38 UTC
I don't know if this makes things faster, but whenever I'm using `split` to split up data, I find it's usually easier to match what I want to keep: `foreach my $line (@arr) { if( $line =~ m/^([^=]+)=(.*)/ ) { my( $name, $value )= ($1,$2); $hashed{ $name }= $value; } else { warn "Unhandled line '$line'"; }; };` [download] This approach also eliminates your `substr` gymnastics.	[reply] [d/l] [select]
Re^2: Process large text data in array by hankcoder (Scribe) on Mar 10, 2015 at 15:14 UTC
Corion, you codes do speed up the process but only able to reduce just few seconds, making it total about 34sec to complete.	[reply]
Re: Process large text data in array by hdb (Monsignor) on Mar 10, 2015 at 14:42 UTC
Turning a line of the format `id_1=value\|id_2=value\|id_3=value\|.....` into a hash can be vastly simplified: `my $line = "id_1=value\|id_2=value\|id_3=value"; my %hash = split /[\|=]/, $line;` [download]	[reply] [d/l] [select]
Re^2: Process large text data in array by Corion (Patriarch) on Mar 10, 2015 at 14:51 UTC
This will result in weird behaviour if the string contains more than one equal sign (`=`) per column: `foo=bar=baz\|bar=bambam` [download]	[reply] [d/l] [select]
Re^3: Process large text data in array by hdb (Monsignor) on Mar 10, 2015 at 14:56 UTC
That is correct! If this case can happen and one insists on splitting on `=`, then the third parameter of split might be useful: `@parts = split /=/, $line, 2;` [download] will return at most two parts, split on the first (if any) equal sign.	[reply] [d/l] [select]
Re^3: Process large text data in array by hankcoder (Scribe) on Mar 10, 2015 at 15:27 UTC
Just to share with you all, before I store any values into my formatted data line, I do HTML::Entities::encode_numeric to make sure those unsafe characters encoded. `id_1=[encoded value]\|.....` [download]	[reply] [d/l]
Re^2: Process large text data in array by hankcoder (Scribe) on Mar 10, 2015 at 14:58 UTC
hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use. If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes? `#---------------------------------------------------# # REC2LINE #---------------------------------------------------# sub rec2line { my (%trec) = @_; my ($newline) = ""; my ($line); foreach $line (keys %trec) { if ($newline ne "") { $newline .= "\|"; } $newline .= "$line=$trec{$line}"; } # end foreach return ("$newline"); } # end sub` [download] Thanks again.	[reply] [d/l]
Re^3: Process large text data in array by hdb (Monsignor) on Mar 10, 2015 at 15:05 UTC
That is what join is for: `$newline = join "\|", map { "$_=$trec{$_}" } keys %trec;` [download]	[reply] [d/l]
Re: Process large text data in array by Laurent_R (Canon) on Mar 10, 2015 at 18:55 UTC
at least if possible to reduce the speed of processing large text data Reducing the speed of your processing of large data is very easy (and does not need any of the counterproductive advice given to you so far by other monks): just add calls to the sleep function. For example (untested code example, because I do not have your data): `my (@new_dat) = (); foreach my $line (@loaded_data) #-- loop thru all data { chomp($line); my (%trec) = &line2rec($line); sleep 1; if ($trec{'active'}) { push(@new_dat, $line); } }` [download] Serious bench making would be needed, but this is likely to reduce the speed by a factor of about 10,000. If this not enough of an improvement, just change to a larger value the parameter passed to the `sleep` builtin. Je suis Charlie.	[reply] [d/l] [select]
Re: Process large text data in array by hankcoder (Scribe) on Mar 10, 2015 at 15:01 UTC
Sorry, post into separate reply. I re-post reply here so others can easily view. hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use. If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes? `#---------------------------------------------------# # REC2LINE #---------------------------------------------------# sub rec2line { my (%trec) = @_; my ($newline) = ""; my ($line); foreach $line (keys %trec) { if ($newline ne "") { $newline .= "\|"; } $newline .= "$line=$trec{$line}"; } # end foreach return ("$newline"); } # end sub` [download] Thanks again.	[reply] [d/l]
Re: Process large text data in array by hankcoder (Scribe) on Mar 11, 2015 at 12:10 UTC
Fastest approach tested so far As suggested by BrowserUk, I have done a test using the file reading method as suggested. The results absolutely encouraging. From previous reading + processing = 21sec+-. It reduced to just 15sec or less with added up more data from 300k to 400k lines of data. `my (@dat) = (); open (DATF, "<$file_name"); while( <DATF> ) { my ($line) = $_; chomp($line); my (%trec) = &line2rec($line); # just do some filtering here if ($trec{'active'}) { } # just testing to move every data line into array push (@dat, $line); } close(DATF);` [download]	[reply] [d/l]
Re^2: Process large text data in array by SimonPratt (Friar) on Mar 11, 2015 at 14:49 UTC
Try this out: `my (@dat) = (); my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; open my $DATF, '<', $file_name; while( chomp(my $line = <$DATF>) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; push (@dat, $line); last; } } close($DATF);` [download] An alternative is this: use threads; use Thread::Queue; use constant MAXTHREADS => 2; my $workQueue = Thread::Queue->new(); my $outQueue = Thread::Queue->new(); my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS; open my $DATF, '<', $file_name; while ( <$DATF> ) { $workQueue->enqueue($_); } close $DATF; $workQueue->end(); $_->join for @threads; $outQueue->end(); my @dat; while (my $line = $outQueue->dequeue()) { push @dat, $line; } sub worker { my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; while ( chomp(my $line = $workQueue->dequeue()) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; $outQueue->enqueue($line); last; } } } [download] The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. This currently requires you to read the entire file into memory first, however pushing the read process into a separate thread resolves that issue and pushing the outqueue processing into a separate thread also assists in reducing memory footprint (assuming you're doing something like writing the data into a filtered output file)	[reply] [d/l] [select]
Re^3: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 15:17 UTC
The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. Sorry, but have you actually run and times that code? Because it will, unfortunately, run anything from 5 to 50 times slower than the single threaded version on any build of Perl, or OS, I am familiar with. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^4: Process large text data in array by SimonPratt (Friar) on Mar 11, 2015 at 16:01 UTC
Re^5: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 17:24 UTC
Some notes below your chosen depth have not been shown here
Re: Process large text data in array by hankcoder (Scribe) on Mar 11, 2015 at 13:17 UTC
My major concern now is what will happen when user interrupted the process in the while..loop when file still open? `open (DATF, "<$file_name"); while( <DATF> ) { #-- do whatever here #-- user may interrupt before finishing the while...loop } close(DATF);` [download] Should I be worry of this? Currently I'm only using this while...loop method for input only (read). As for writing data, likely data gonna corrupt. Any suggestions on this? Thanks.	[reply] [d/l]
Re^2: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 13:56 UTC
My major concern now is what will happen when user interrupted the process in the while..loop when file still open? The same thing as would happen if he interrupted the program while you were filling the array in your OP code. That is: the file will be closed and the program will exit without producing any output. As you are only reading the file, no data will be harmed. Of more concern is what happens if you are producing output from within the while loop. Then, if the user interupts, the output file can contain only partial data. To address the latter concern -- and prevent any worries about the former -- install an interrupt handler near the top of your program (or in a BEGIN block): `$SIG{ INT } = sub{}; ... while( ... ) { ... }` [download] That will prevent the user interrupting with ^C. You can do a similar thing for most</> other signals that the user might use to interrupt. Serach for "%SIG" in perlvar. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^3: Process large text data in array by hankcoder (Scribe) on Mar 11, 2015 at 17:23 UTC
Oh wow, this is something New for me to look into. What if the interrupt is line disconnected/closing browser/stop page loading? My program are mainly run thru www web browser, there is no command line execution in concern here. Does the interrupt handler you suggested able to capture this or they are all same? In my own theory, if possible to capture such interrupt with custom INT handle function, then I should be able to do some cleanup in the function. Eg. `sub INT_handler { # check for any unfinished jobs # close all files exit(0); } $SIG{'INT'} = 'INT_handler';` [download] The code above are my own untested modification theory from google search.	[reply] [d/l]
Re^4: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 17:28 UTC
Re^5: Process large text data in array by hankcoder (Scribe) on Mar 11, 2015 at 17:44 UTC

The whole process can be reduced to (something like; the regex will probably need tweaking to select the appropriate field):