Re: Process large text data in array
by BrowserUk (Patriarch) on Mar 10, 2015 at 14:56 UTC
|
However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria,
Why read it all in -- ie. read every line, allocate space for every line, extend the array to accommodate every line -- if you only need to process the array once?
In other words, why not?:
while( <TEMPFILE> ) {
processLine( $_ )
}
Also, you are throwing away performance with how you are passing data back from your subroutines. Eg: return (%hashed);
That builds a hash in hash_array(), the return statement converts it to a list on the stack; then back at the call site: my (%trec) = &hash_array(@arr);
You convert that list from the stack back into another hash. Then, you immediately return that hash to the caller of line2rec(), converting it to another list on the stack: my (%trec) = &hash_array(@arr);
return (%trec);
}
And then back at that call site, you convert that list back into yet another hash: my (%trec) = &line2rec($line);
And all of that in order to test if the line contains the string 'active': if ($trec{'active'})
The whole process can be reduced to (something like; the regex will probably need tweaking to select the appropriate field):my @data;
while( <TEMPFILE> ) {
/active/ and push @data, $_;
}
It'll be *much* faster.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
|
BrowserUk Thanks for pointing it out. The process not only checking just for "active" value, there are more checking, it is only sample. I built the codes into sub so it is easier for me to refer and debug in future.
I'm more prefer to use separate sub calling to get the file content instead of using
while( <TEMPFILE> ) {
processLine( $_ )
}
in every part of codes I'm going to retrieve the file content. I'm taking your notes, will do more test for every codes of it. Thanks. | [reply] [d/l] |
|
Swapping back and forth (and back and forth again) between the hash and a list is still inefficient.
Use a hash reference instead, so it won't have to make multiple copies of your hash contents.
| [reply] |
|
I share the opinion that it is quite unnecessary to read a 38MB disk-file into virtual memory in order to process it. In particular, if when that file becomes, say, “10 times larger than it now is,” your current approach might begin to fail. It’s just as easy to pass a file-handle around, and to let that be “your input,” as it is to handle a beefy array. Also consider, if necessary, defining a sub (perhaps, a reference to an anonymous function ...) that can be used to filter each of the lines as they are read: the while loop simply goes to the next line when this function, say, returns False.
We know that BrowserUK, in his daily $work, deals with enormous datasets in very high-performance situations. If he says what he just did about your situation, then, frankly, I would take it as a very well-informed directive to “do it that way.” :-)
| |
|
|
Re: Process large text data in array
by Corion (Patriarch) on Mar 10, 2015 at 14:38 UTC
|
I don't know if this makes things faster, but whenever I'm using split to split up data, I find it's usually easier to match what I want to keep:
foreach my $line (@arr) {
if( $line =~ m/^([^=]+)=(.*)/ ) {
my( $name, $value )= ($1,$2);
$hashed{ $name }= $value;
} else {
warn "Unhandled line '$line'";
};
};
This approach also eliminates your substr gymnastics. | [reply] [d/l] [select] |
|
| [reply] |
Re: Process large text data in array
by hdb (Monsignor) on Mar 10, 2015 at 14:42 UTC
|
Turning a line of the format id_1=value|id_2=value|id_3=value|..... into a hash can be vastly simplified:
my $line = "id_1=value|id_2=value|id_3=value";
my %hash = split /[|=]/, $line;
| [reply] [d/l] [select] |
|
foo=bar=baz|bar=bambam
| [reply] [d/l] [select] |
|
@parts = split /=/, $line, 2;
will return at most two parts, split on the first (if any) equal sign.
| [reply] [d/l] [select] |
|
id_1=[encoded value]|.....
| [reply] [d/l] |
|
hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use.
If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes?
#---------------------------------------------------#
# REC2LINE
#---------------------------------------------------#
sub rec2line
{
my (%trec) = @_;
my ($newline) = "";
my ($line);
foreach $line (keys %trec)
{
if ($newline ne "")
{
$newline .= "|";
}
$newline .= "$line=$trec{$line}";
} # end foreach
return ("$newline");
} # end sub
Thanks again. | [reply] [d/l] |
|
$newline = join "|", map { "$_=$trec{$_}" } keys %trec;
| [reply] [d/l] |
Re: Process large text data in array
by Laurent_R (Canon) on Mar 10, 2015 at 18:55 UTC
|
at least if possible to reduce the speed of processing large text data
Reducing the speed of your processing of large data is very easy (and does not need any of the counterproductive advice given to you so far by other monks): just add calls to the sleep function. For example (untested code example, because I do not have your data):
my (@new_dat) = ();
foreach my $line (@loaded_data) #-- loop thru all data
{
chomp($line);
my (%trec) = &line2rec($line);
sleep 1;
if ($trec{'active'})
{
push(@new_dat, $line);
}
}
Serious bench making would be needed, but this is likely to reduce the speed by a factor of about 10,000. If this not enough of an improvement, just change to a larger value the parameter passed to the sleep builtin.
| [reply] [d/l] [select] |
Re: Process large text data in array
by hankcoder (Scribe) on Mar 10, 2015 at 15:01 UTC
|
Sorry, post into separate reply. I re-post reply here so others can easily view.
hdb your codes are excellent!! The speed reduced to only 21sec to complete. My previous sub codes were rather old and previous data format may contain more than 1 delimiter characters. But all my current data format will have "safe characters" encoding before storing. So I guess it is safe to use your code for my purpose use.
If it is not too trouble, maybe could you help me improve the reversal of line2rec? Or that is the simplest and faster it can goes?
#---------------------------------------------------#
# REC2LINE
#---------------------------------------------------#
sub rec2line
{
my (%trec) = @_;
my ($newline) = "";
my ($line);
foreach $line (keys %trec)
{
if ($newline ne "")
{
$newline .= "|";
}
$newline .= "$line=$trec{$line}";
} # end foreach
return ("$newline");
} # end sub
Thanks again.
| [reply] [d/l] |
Re: Process large text data in array
by hankcoder (Scribe) on Mar 11, 2015 at 12:10 UTC
|
** Fastest approach tested so far **
As suggested by BrowserUk, I have done a test using the file reading method as suggested. The results absolutely encouraging. From previous reading + processing = 21sec+-. It reduced to just 15sec or less with added up more data from 300k to 400k lines of data.
my (@dat) = ();
open (DATF, "<$file_name");
while( <DATF> ) {
my ($line) = $_;
chomp($line);
my (%trec) = &line2rec($line);
# just do some filtering here
if ($trec{'active'})
{
}
# just testing to move every data line into array
push (@dat, $line);
}
close(DATF);
| [reply] [d/l] |
|
my (@dat) = ();
my @filters;
push @filters, sub { /active/ ? 1 : undef };
push @filters, sub { /anotherfilter/ ? 1 : undef };
open my $DATF, '<', $file_name;
while( chomp(my $line = <$DATF>) ) {
foreach my $filter (@filters) {
my $newline = $filter->($line) or next;
push (@dat, $line);
last;
}
}
close($DATF);
An alternative is this:
use threads;
use Thread::Queue;
use constant MAXTHREADS => 2;
my $workQueue = Thread::Queue->new();
my $outQueue = Thread::Queue->new();
my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS;
open my $DATF, '<', $file_name;
while ( <$DATF> ) {
$workQueue->enqueue($_);
}
close $DATF;
$workQueue->end();
$_->join for @threads;
$outQueue->end();
my @dat;
while (my $line = $outQueue->dequeue()) {
push @dat, $line;
}
sub worker {
my @filters;
push @filters, sub { /active/ ? 1 : undef };
push @filters, sub { /anotherfilter/ ? 1 : undef };
while ( chomp(my $line = $workQueue->dequeue()) ) {
foreach my $filter (@filters) {
my $newline = $filter->($line) or next;
$outQueue->enqueue($line);
last;
}
}
}
The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. This currently requires you to read the entire file into memory first, however pushing the read process into a separate thread resolves that issue and pushing the outqueue processing into a separate thread also assists in reducing memory footprint (assuming you're doing something like writing the data into a filtered output file) | [reply] [d/l] [select] |
|
The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you.
Sorry, but have you actually run and times that code?
Because it will, unfortunately, run anything from 5 to 50 times slower than the single threaded version on any build of Perl, or OS, I am familiar with.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] |
|
|
|
Re: Process large text data in array
by hankcoder (Scribe) on Mar 11, 2015 at 13:17 UTC
|
My major concern now is what will happen when user interrupted the process in the while..loop when file still open?
open (DATF, "<$file_name");
while( <DATF> ) {
#-- do whatever here
#-- user may interrupt before finishing the while...loop
}
close(DATF);
Should I be worry of this? Currently I'm only using this while...loop method for input only (read). As for writing data, likely data gonna corrupt.
Any suggestions on this? Thanks. | [reply] [d/l] |
|
My major concern now is what will happen when user interrupted the process in the while..loop when file still open?
The same thing as would happen if he interrupted the program while you were filling the array in your OP code.
That is: the file will be closed and the program will exit without producing any output. As you are only reading the file, no data will be harmed.
Of more concern is what happens if you are producing output from within the while loop. Then, if the user interupts, the output file can contain only partial data.
To address the latter concern -- and prevent any worries about the former -- install an interrupt handler near the top of your program (or in a BEGIN block):
$SIG{ INT } = sub{};
...
while( ... ) {
...
}
That will prevent the user interrupting with ^C. You can do a similar thing for most</> other signals that the user might use to interrupt.
Serach for "%SIG" in perlvar.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] |
|
Oh wow, this is something New for me to look into.
What if the interrupt is line disconnected/closing browser/stop page loading? My program are mainly run thru www web browser, there is no command line execution in concern here. Does the interrupt handler you suggested able to capture this or they are all same?
In my own theory, if possible to capture such interrupt with custom INT handle function, then I should be able to do some cleanup in the function. Eg.
sub INT_handler {
# check for any unfinished jobs
# close all files
exit(0);
}
$SIG{'INT'} = 'INT_handler';
The code above are my own untested modification theory from google search. | [reply] [d/l] |
|
|