baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi monks,

i need a help with this one

i would like to speed up my reading from a file. i know i can do that by first splitting a file in several small files and then fork the reading process. is there any other, more elegant way to do this.

why am i trying to do this? well the file i'm dealing with is taking a line from a file and processes it . the processing is what is slowing my procedure down, and because i'm while-looping through a file, my idea was to split a file and then fork the whole procedure for every peace of a file. putting a file in memory is not an option(file is too big for my PC)

so what i'm asking for is a different point of view on this problem of mine. a different idea...

thnx

pseudocode open (original_file); $counter = 0; while(original_file){ $counter++; } my $peace = $counter/4; # let say the file has an equale number of li +nes my $count_for_peace = 0; open (file_part); while(original_file){ if ($count_for_peace == $peace){ close file_part_handled; open(file_part_new); $count_for_peace = 0; } print into file_part_handled $count_for_peace++; } my @ch; for(1..4){ my $pid = fork(); if ($pid){ push(@ch,$pid); } elsif($pid ==0){ #read from file 1 and do some processing exit; } else{ die error; } } foreach (@ch){ waitpid($_,0); }
this is just an example of what i do to speed up my work!

Update:

#read from file 1 and do some processing

i realy didn't benchmark that but what really happens here is the line is read, through regex the number is identified and then this number is looked for in the in-memory hashed table. and then according to some correlated value from that table some quick statistical corection is calculated for that value(FDR). so basicly what i was thinking of when trying to speed things up is to divide my calculation and regex identification through several cores (CPU's are on 100% when i do my parallelization as mentioned). i'll do some benchmarking later and post the results

Replies are listed 'Best First'.
Re: fork IO
by Corion (Patriarch) on Jun 08, 2009 at 08:51 UTC

    Personally, I would make the program take a byte offset, not a line number. That way, you can save the scanning of the line numbers to do an "optimal" distribution and instead seek to the next line after the "start" line, and stop after moving the file position after an end line. I'd use a modified runN to do the parallelization - it will simply spawn four copies of your program and give it the start and stop parameters:

    # line processing program my ($file,$start,$stop) = @ARGV; $start ||= 0; $stop ||= -s $file; open my $fh, '<', $file; seek $fh, $start, 0; # now position at the first line <i>after</i> $start # if we're not starting at the beginning of a file if ($start) { <$fh>; }; my $position = tell $fh; while (<$fh> and $position <= $end) { $position += length $_; ... };

    You could also use tell every trip around the main loop, but I found tell to be slowing down some of my IO loops. On the other hand, if IO is slowing you down, fork and all the CPUs in the world won't make that faster.

Re: fork IO
by jethro (Monsignor) on Jun 08, 2009 at 08:45 UTC

    It seems the processing of the lines is independent of each other. If the output is as well, you might give each fork a different starting line (0 to 3) and let them read the same file but only process every fourth line. No need to split the source file then.

Re: fork IO
by cdarke (Prior) on Jun 08, 2009 at 08:53 UTC
    If the lines are fixed length then you can determine the offset position of each chunk assigned to each of your worker processes fairly easily. Then use seek to position the file pointer (after the open and before you read).

    If the lines are not fixed length then you need to find the position of each chunk by scanning the file (use tell). I suggest you only do this once in the primary (first) program and pass the offset and length to the children.

    On Windows you might be better using overlapped IO, but that requires the Win32 API.
Re: fork IO
by BrowserUk (Patriarch) on Jun 08, 2009 at 13:19 UTC