in reply to Re^2: multiple processes to access one file
in thread multiple processes to access one file
Having 4 processes (or as I would use: threads) reading from different parts of the same file is not a problem.
The problems come entirely from the hideous format design of FastQ files.
As Wikipedia puts it:
it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
Dividing the file size in to 4 and having each thread/process seek into the file to a different position is simple and fast.
The problem is how to then skip forward from the calculated start position to locate the start of the nearest (next) 4-line record.
The format specifies that the first character of the 1st line of each record is '@'; and the first character of the 3rd line is '+'; but dumbly, these marker characters can also appear as part of the quality information in the 2nd and 4th lines -- including as the first character of each of those lines.
That makes leaping into the file and find the start of a record surprisingly difficult, with the only simple alternative being to read forward in groups of 4 lines from the start of the file, which kinda defeats the purpose.
Theoretically, reading forward until you have 4 consecutive lines where the 1st & 3rd start with '@' & '+' respectively, and the other two do not, should (I think) establish a datum point from which an appropriate starting point can be calculated from which each thread/process can advance. I'll get back to you once I've convinced myself of that.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: multiple processes to access one file
by Anonymous Monk on Aug 21, 2012 at 07:03 UTC | |
by BrowserUk (Patriarch) on Aug 21, 2012 at 07:29 UTC | |
by julio_514 (Acolyte) on Aug 21, 2012 at 07:04 UTC |