Hi, I'm trying to find better solutions or at least if possible to reduce the speed of processing large text data which were read from file and put in array. I will try to put everything as simple as possible and post here the related codes.
Current test is on local machine Windows XP Pro running ActivePerl. Live system will be on Unix/Linux environment.
The text file format are line by line, not fixed length. Current test contain 300k of lines about 38MB. Line format sample are:
id_1=value|id_2=value|id_3=value|.....
I tested using 2 ways of retrieving file content, both works very fast, in just about 4sec.
#-------------------------------------------------------# #-------------------------------------------------------# sub get_filecontent { my @temp_data; open (TEMPFILE,"$_[0]"); @temp_data = <TEMPFILE>; close (TEMPFILE); return( @temp_data ); } #-------------------------------------------------------# #-------------------------------------------------------# sub get_fileRef { my ($fname, $ref_dat) = @_; open (TEMPFILE,"$fname"); @{$ref_dat} = <TEMPFILE>; close (TEMPFILE); }
However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria, and finding total matches. Added this process, the total time it takes is about 37sec.
I'm not sure if this speed is normal, but if can reduce it, that is really great.
The codes use to process the array are here:
#-- filter my (@new_dat) = (); foreach my $line (@loaded_data) #-- loop thru all data { chomp($line); my (%trec) = &line2rec($line); if ($trec{'active'}) { push(@new_dat, $line); } } (@loaded_data) = (@new_dat); #-- overwrite (@new_dat) = ();
sub routine codes for converting line2rec
#---------------------------------------------------# # LINE2REC #---------------------------------------------------# # convert a line into a record by separator | sub line2rec { my ($line) = @_; my (@arr) = split( /\|/, "$line" ); my (%trec) = &hash_array(@arr); return (%trec); } #---------------------------------------------------# #---------------------------------------------------# sub hash_array { my (@arr) = @_; my ($line, $name, $value, $len_name); my (@parts) = (); my (%hashed) = (); foreach $line (@arr) { chomp($line); if ($line =~ /=/) { (@parts) = (); (@parts) = split( /\=/, $line ); #-- just incase got more than +one = separator $name = "$parts[0]"; #-- use first element as name $len_name = length($name)+1; $value = substr( "$line", $len_name, length("$line")-$len_name +); #-- !! cannot use join, if last char is separator then will dis +appear after split $hashed{$name} = $value; } } return (%hashed); }
If I remarks out sub &line2rec($line); the speed reduced to 10sec. So I guess this sub codes can be further improved.
Any suggestions are much appreciated. Thanks.
In reply to Process large text data in array by hankcoder
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |