Having 4 processes (or as I would use: threads) reading from different parts of the same file is not a problem.
The problems come entirely from the hideous format design of FastQ files.
As Wikipedia puts it:
it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
Dividing the file size in to 4 and having each thread/process seek into the file to a different position is simple and fast.
The problem is how to then skip forward from the calculated start position to locate the start of the nearest (next) 4-line record.
The format specifies that the first character of the 1st line of each record is '@'; and the first character of the 3rd line is '+'; but dumbly, these marker characters can also appear as part of the quality information in the 2nd and 4th lines -- including as the first character of each of those lines.
That makes leaping into the file and find the start of a record surprisingly difficult, with the only simple alternative being to read forward in groups of 4 lines from the start of the file, which kinda defeats the purpose.
Theoretically, reading forward until you have 4 consecutive lines where the 1st & 3rd start with '@' & '+' respectively, and the other two do not, should (I think) establish a datum point from which an appropriate starting point can be calculated from which each thread/process can advance. I'll get back to you once I've convinced myself of that.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
How about Tie::File? From what I understand, it does not store the file in memory, could be suitable? We have a list of all possible regular expression matching a sequence header for the fastqs that we are working with. Say for a set of 4 lines, you would have:
@BLABLA:1:2:3:2
ACGTACGT...
+
DDDEEHFG...
So all headers are: ^@\S+:\d+:\d+:\d+:\d+\n
But headers are not of a fixed length, so how would you seek into a file if you need the byte start and end position?
| [reply] [d/l] [select] |