It seems that you have genome wide methylation data
measured using Illumina bead chips, and your file has been
generated as a genome studio final report, correct?
In this case I'd recommend considering the following
points:
- You may be better off not starting from the report
file but from the idat files instead (the original
files that went into the genome studio project) and
use the appropriate R-packages (methylumi, minfi,
etc.) from the Bioconductor repository for processing
the data
- Of course this requires sufficient memory. So
the machine should have at least 8 GB RAM available
(better 16)
- There are some packages that claim to be able to
load genome studio reports (methylumi). But I'm not
sure about that.
- If you still want to split the file using perl,
the script should work for any number of columns per
individual. Thus avoid hard coding this number in the
file, as you may discover that you are missing some
information and you have to regenerate the report with
different settings and a different number of columns
per subject.
- Unfortunately many available programs for data
processing (in particular some R functions in several
packages) are not capable of obeying i18n rules and
thus just use "." as decimal separator. Or they
use the encoding in inconsistent manner. To ensure
data integrity it seems best to change your language
settings to English and as well to change all ","
in the data file to ".". So one of the 1st things to
do for each line after reading in would be "tr/,/./;",
if you are using "," as regular decimal separator -
like in your provided data file.
- Under the precondition of enough available file
handles it is possible to generate all output files
in one pass (Just as poj's solution assumes; e.g. for
newer windows platforms this number is practically
unlimited (nominally about 16 mio per process). If
this is not the case you'd have to do this in several
passes.
Does poj's suggestion suit your needs?