julio_514 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I 'd like to know if doing the following is possible. If it is, some insights on how to proceed or where to look would be highly appreciated.

Basically, I have one file of 100 GB having dna sequences holding on exactly 4 lines for each sequence entries. Is it possible to launch multiple child process (multicore machine) to access simultaneously different part of the file, perform some calculation and write to an output. I don't know how to specify which part of the file each processes will work on.

I know this kind of work is feasible by splitting the file with say the split command and write a fork::manager loop, but would like to save time in avoiding writing new split files to disk.

Thanks! Julio

Replies are listed 'Best First'.
Re: multiple processes to access one file
by BrowserUk (Patriarch) on Aug 21, 2012 at 07:55 UTC

    Here is a threaded version. It should be readily adaptable to forking if that is your thing:

    #! perl -slw use strict; use threads; use threads::shared; ## threadsafe output routines. $|++; ## Doesn't work without this! my $semStdout :shared; sub tprint{ lock $semStdout; print @_; } my $semStderr :shared; sub twarn{ lock $semStderr; print STDERR @_; } sub findNextRecStart { ## filehandle, calculated start byte, thread id (for tracing my( $fh, $start, $tid ) = @_; # twarn "[$tid] $start"; ## seek to the start byte -1; Just incase the calculated posn hits + bang on seek $fh, $start-1, 0; ## Read a buffer full; we'd be really unluck to not find a record +start in 4k ## But you could increase this to say 64k if it fails. read( $fh, my $buffer, 4096 ); ## Search for a full record that doesn't have @/+ as the first cha +r in the 2nd line $buffer =~ m[\n(\@)(?:[^@+][^\n]+\n){2}\+] or die "Couldn't locate + record start"; ## Remember the offset into the buffer where we found it. my $startFound = $-[1]; ## Now count the lines between the start of the buffer and that po +int. my $previousLines = substr( $buffer, 0, $startFound ) =~ tr[\n][\n +]; ## And calulate our way back to the first full record after the ca +lculated start posn. my $skipLines = ( $previousLines - 1) % 4 +1; # twarn "[$tid] $skipLines"; ## Seek bck to that calculated start posn. seek $fh, $start, 0; ## Then skip forward th calculate dnumber of lines. scalar <$fh> for 1 .. $skipLines; # twarn "[$tid] ", tell $fh; return; } sub worker { my $tid = threads->tid; ## the name of the file, the byte offsets for the thread ## to start and end processing at my( $file, $start, $end ) = @_; open my $FASTQ, '<', $file or die $!; ## If a no-zero start posns, find the start of the next full recor +d. findNextRecStart( $FASTQ, $start, $tid ) if $start; ## process records until the end of this threads section. while( tell( $FASTQ ) < $end ) { my @lines = map scalar( <$FASTQ> ), 1 .. 4; chomp @lines; ## process this record ( in @lines[ 0 .. 3 ] ) here... tprint "[$tid] $lines[0]"; } } ## Grab the size of the file my $size = -s $ARGV[0] or die "$! : $ARGV[ 0 ]"; ## Calculate each threads start posn my $quarter = int( $size / 4 ); my @starts = map $quarter * $_, 0 .. 3; push @starts, $size; ## Start 4 threads and wait for them to end. $_->join for map{ async( \&worker, $ARGV[ 0 ], @starts[ $_, $_ +1 ] ) } 0 .. 3;

    And a test. Generate a fastq file with a mix of good and dodgey records (Every second record is clean!):

    perl -E"say qq[\@record ${\($_-1)}\n\@pqrs\n+record ${\($_-1)}\n+pqrs\ +n\@record $_\nabcd\n+record $_\nefgh] for map{ $_*2-1 } 1 .. 25" > te +st.fastq C:\test>head test.fastq @record 0 @pqrs +record 0 +pqrs @record 1 abcd +record 1 efgh @record 2 @pqrs

    And process that file using 4 threads, outputting just the first line of each record:

    C:\test>988536 test.fastq [1] @record 0 [1] @record 1 [1] @record 2 [1] @record 3 [1] @record 4 [2] @record 13 [2] @record 14 [1] @record 5 [1] @record 6 [1] @record 7 [1] @record 8 [2] @record 15 [2] @record 16 [1] @record 9 [4] @record 38 [3] @record 26 [2] @record 17 [1] @record 10 [4] @record 39 [3] @record 27 [2] @record 18 [1] @record 11 [4] @record 40 [3] @record 28 [2] @record 19 [1] @record 12 [4] @record 41 [3] @record 29 [2] @record 20 [4] @record 42 [3] @record 30 [2] @record 21 [4] @record 43 [3] @record 31 [2] @record 22 [4] @record 44 [3] @record 32 [2] @record 23 [4] @record 45 [3] @record 33 [2] @record 24 [4] @record 46 [3] @record 34 [2] @record 25 [4] @record 47 [3] @record 35 [4] @record 48 [4] @record 49 [3] @record 36 [3] @record 37

    If you check the records numbers, all 50 resords (0 through 49) are processed correctly.

    Now run the program on a 400 MB fastq file:

    [ 8:52:25.06] C:\test>988536 sample.fastq | wc -l 2500000 [ 8:52:57.92] C:\test>

    2.5 million records in 32 seconds.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Wow Thanks mate!! I'll give it a try and let you know how it goes. J
Re: multiple processes to access one file
by BrowserUk (Patriarch) on Aug 21, 2012 at 01:22 UTC
    having dna sequences holding on exactly 4 lines for each sequence entries.

    Are these FastQ files?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      yes they are.

        Having 4 processes (or as I would use: threads) reading from different parts of the same file is not a problem.

        The problems come entirely from the hideous format design of FastQ files.

        As Wikipedia puts it:

        it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

        Dividing the file size in to 4 and having each thread/process seek into the file to a different position is simple and fast.

        The problem is how to then skip forward from the calculated start position to locate the start of the nearest (next) 4-line record.

        The format specifies that the first character of the 1st line of each record is '@'; and the first character of the 3rd line is '+'; but dumbly, these marker characters can also appear as part of the quality information in the 2nd and 4th lines -- including as the first character of each of those lines.

        That makes leaping into the file and find the start of a record surprisingly difficult, with the only simple alternative being to read forward in groups of 4 lines from the start of the file, which kinda defeats the purpose.

        Theoretically, reading forward until you have 4 consecutive lines where the 1st & 3rd start with '@' & '+' respectively, and the other two do not, should (I think) establish a datum point from which an appropriate starting point can be calculated from which each thread/process can advance. I'll get back to you once I've convinced myself of that.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?