Having 4 processes (or as I would use: threads) reading from different parts of the same file is not a problem.
The problems come entirely from the hideous format design of FastQ files.
As Wikipedia puts it:
it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
Dividing the file size in to 4 and having each thread/process seek into the file to a different position is simple and fast.
The problem is how to then skip forward from the calculated start position to locate the start of the nearest (next) 4-line record.
The format specifies that the first character of the 1st line of each record is '@'; and the first character of the 3rd line is '+'; but dumbly, these marker characters can also appear as part of the quality information in the 2nd and 4th lines -- including as the first character of each of those lines.
That makes leaping into the file and find the start of a record surprisingly difficult, with the only simple alternative being to read forward in groups of 4 lines from the start of the file, which kinda defeats the purpose.
Theoretically, reading forward until you have 4 consecutive lines where the 1st & 3rd start with '@' & '+' respectively, and the other two do not, should (I think) establish a datum point from which an appropriate starting point can be calculated from which each thread/process can advance. I'll get back to you once I've convinced myself of that.
In reply to Re^3: multiple processes to access one file
by BrowserUk
in thread multiple processes to access one file
by julio_514
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |