Hi, I'm trying to find better solutions or at least if possible to reduce the speed of processing large text data which were read from file and put in array. I will try to put everything as simple as possible and post here the related codes.

Current test is on local machine Windows XP Pro running ActivePerl. Live system will be on Unix/Linux environment.

The text file format are line by line, not fixed length. Current test contain 300k of lines about 38MB. Line format sample are:

id_1=value|id_2=value|id_3=value|.....

I tested using 2 ways of retrieving file content, both works very fast, in just about 4sec.

#-------------------------------------------------------# #-------------------------------------------------------# sub get_filecontent { my @temp_data; open (TEMPFILE,"$_[0]"); @temp_data = <TEMPFILE>; close (TEMPFILE); return( @temp_data ); } #-------------------------------------------------------# #-------------------------------------------------------# sub get_fileRef { my ($fname, $ref_dat) = @_; open (TEMPFILE,"$fname"); @{$ref_dat} = <TEMPFILE>; close (TEMPFILE); }

However, after read all data into memory, I need to process it from beginning to end once to get wanted data line based on given criteria, and finding total matches. Added this process, the total time it takes is about 37sec.

I'm not sure if this speed is normal, but if can reduce it, that is really great.

The codes use to process the array are here:

#-- filter my (@new_dat) = (); foreach my $line (@loaded_data) #-- loop thru all data { chomp($line); my (%trec) = &line2rec($line); if ($trec{'active'}) { push(@new_dat, $line); } } (@loaded_data) = (@new_dat); #-- overwrite (@new_dat) = ();

sub routine codes for converting line2rec

#---------------------------------------------------# # LINE2REC #---------------------------------------------------# # convert a line into a record by separator | sub line2rec { my ($line) = @_; my (@arr) = split( /\|/, "$line" ); my (%trec) = &hash_array(@arr); return (%trec); } #---------------------------------------------------# #---------------------------------------------------# sub hash_array { my (@arr) = @_; my ($line, $name, $value, $len_name); my (@parts) = (); my (%hashed) = (); foreach $line (@arr) { chomp($line); if ($line =~ /=/) { (@parts) = (); (@parts) = split( /\=/, $line ); #-- just incase got more than +one = separator $name = "$parts[0]"; #-- use first element as name $len_name = length($name)+1; $value = substr( "$line", $len_name, length("$line")-$len_name +); #-- !! cannot use join, if last char is separator then will dis +appear after split $hashed{$name} = $value; } } return (%hashed); }

If I remarks out sub &line2rec($line); the speed reduced to 10sec. So I guess this sub codes can be further improved.

Any suggestions are much appreciated. Thanks.


In reply to Process large text data in array by hankcoder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.