multiple processes to access one file

julio_514 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: multiple processes to access one file by BrowserUk (Patriarch) on Aug 21, 2012 at 07:55 UTC
Here is a threaded version. It should be readily adaptable to forking if that is your thing: #! perl -slw use strict; use threads; use threads::shared; ## threadsafe output routines. $\|++; ## Doesn't work without this! my $semStdout :shared; sub tprint{ lock $semStdout; print @_; } my $semStderr :shared; sub twarn{ lock $semStderr; print STDERR @_; } sub findNextRecStart { ## filehandle, calculated start byte, thread id (for tracing my( $fh, $start, $tid ) = @_; # twarn "[$tid] $start"; ## seek to the start byte -1; Just incase the calculated posn hits + bang on seek $fh, $start-1, 0; ## Read a buffer full; we'd be really unluck to not find a record +start in 4k ## But you could increase this to say 64k if it fails. read( $fh, my $buffer, 4096 ); ## Search for a full record that doesn't have @/+ as the first cha +r in the 2nd line $buffer =~ m[\n(\@)(?:[^@+][^\n]+\n){2}\+] or die "Couldn't locate + record start"; ## Remember the offset into the buffer where we found it. my $startFound = $-[1]; ## Now count the lines between the start of the buffer and that po +int. my $previousLines = substr( $buffer, 0, $startFound ) =~ tr[\n][\n +]; ## And calulate our way back to the first full record after the ca +lculated start posn. my $skipLines = ( $previousLines - 1) % 4 +1; # twarn "[$tid] $skipLines"; ## Seek bck to that calculated start posn. seek $fh, $start, 0; ## Then skip forward th calculate dnumber of lines. scalar <$fh> for 1 .. $skipLines; # twarn "[$tid] ", tell $fh; return; } sub worker { my $tid = threads->tid; ## the name of the file, the byte offsets for the thread ## to start and end processing at my( $file, $start, $end ) = @_; open my $FASTQ, '<', $file or die $!; ## If a no-zero start posns, find the start of the next full recor +d. findNextRecStart( $FASTQ, $start, $tid ) if $start; ## process records until the end of this threads section. while( tell( $FASTQ ) < $end ) { my @lines = map scalar( <$FASTQ> ), 1 .. 4; chomp @lines; ## process this record ( in @lines[ 0 .. 3 ] ) here... tprint "[$tid] $lines[0]"; } } ## Grab the size of the file my $size = -s $ARGV[0] or die "$! : $ARGV[ 0 ]"; ## Calculate each threads start posn my $quarter = int( $size / 4 ); my @starts = map $quarter * $_, 0 .. 3; push @starts, $size; ## Start 4 threads and wait for them to end. $_->join for map{ async( \&worker, $ARGV[ 0 ], @starts[ $_, $_ +1 ] ) } 0 .. 3; [download] And a test. Generate a fastq file with a mix of good and dodgey records (Every second record is clean!): `perl -E"say qq[\@record ${\($_-1)}\n\@pqrs\n+record ${\($_-1)}\n+pqrs\ +n\@record $_\nabcd\n+record $_\nefgh] for map{ $_*2-1 } 1 .. 25" > te +st.fastq C:\test>head test.fastq @record 0 @pqrs +record 0 +pqrs @record 1 abcd +record 1 efgh @record 2 @pqrs` [download] And process that file using 4 threads, outputting just the first line of each record: C:\test>988536 test.fastq [1] @record 0 [1] @record 1 [1] @record 2 [1] @record 3 [1] @record 4 [2] @record 13 [2] @record 14 [1] @record 5 [1] @record 6 [1] @record 7 [1] @record 8 [2] @record 15 [2] @record 16 [1] @record 9 [4] @record 38 [3] @record 26 [2] @record 17 [1] @record 10 [4] @record 39 [3] @record 27 [2] @record 18 [1] @record 11 [4] @record 40 [3] @record 28 [2] @record 19 [1] @record 12 [4] @record 41 [3] @record 29 [2] @record 20 [4] @record 42 [3] @record 30 [2] @record 21 [4] @record 43 [3] @record 31 [2] @record 22 [4] @record 44 [3] @record 32 [2] @record 23 [4] @record 45 [3] @record 33 [2] @record 24 [4] @record 46 [3] @record 34 [2] @record 25 [4] @record 47 [3] @record 35 [4] @record 48 [4] @record 49 [3] @record 36 [3] @record 37 [download] If you check the records numbers, all 50 resords (0 through 49) are processed correctly. Now run the program on a 400 MB fastq file: `[ 8:52:25.06] C:\test>988536 sample.fastq \| wc -l 2500000 [ 8:52:57.92] C:\test>` [download] 2.5 million records in 32 seconds. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^2: multiple processes to access one file by julio_514 (Acolyte) on Aug 21, 2012 at 18:26 UTC
Wow Thanks mate!! I'll give it a try and let you know how it goes. J	[reply]
Re: multiple processes to access one file by BrowserUk (Patriarch) on Aug 21, 2012 at 01:22 UTC
having dna sequences holding on exactly 4 lines for each sequence entries. Are these FastQ files? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: multiple processes to access one file by julio_514 (Acolyte) on Aug 21, 2012 at 01:34 UTC
yes they are.	[reply]
Re^3: multiple processes to access one file by BrowserUk (Patriarch) on Aug 21, 2012 at 02:32 UTC
Having 4 processes (or as I would use: threads) reading from different parts of the same file is not a problem. The problems come entirely from the hideous format design of FastQ files. As Wikipedia puts it: it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string). Dividing the file size in to 4 and having each thread/process seek into the file to a different position is simple and fast. The problem is how to then skip forward from the calculated start position to locate the start of the nearest (next) 4-line record. The format specifies that the first character of the 1st line of each record is '@'; and the first character of the 3rd line is '+'; but dumbly, these marker characters can also appear as part of the quality information in the 2nd and 4th lines -- including as the first character of each of those lines. That makes leaping into the file and find the start of a record surprisingly difficult, with the only simple alternative being to read forward in groups of 4 lines from the start of the file, which kinda defeats the purpose. Theoretically, reading forward until you have 4 consecutive lines where the 1st & 3rd start with '@' & '+' respectively, and the other two do not, should (I think) establish a datum point from which an appropriate starting point can be calculated from which each thread/process can advance. I'll get back to you once I've convinced myself of that. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: multiple processes to access one file by Anonymous Monk on Aug 21, 2012 at 07:03 UTC
Re^5: multiple processes to access one file by BrowserUk (Patriarch) on Aug 21, 2012 at 07:29 UTC
Re^5: multiple processes to access one file by julio_514 (Acolyte) on Aug 21, 2012 at 07:04 UTC