I have a tab-separated 2 column input file with clustering information where column 1 contains the ID of the cluster representative, while column 2 contains the ID of the cluster member
Put differently, if there are 1000 elements clustered into 50 clusters, my input file will have 1000 lines with the ID of the cluster member in column 2, and the ID of the cluster representative in column 1
Therefore, the 1st line for each cluster will necessarily contain 2 identical columns, i.e. cluster representative and cluster member are identical
If there is more than one member of a cluster, then in the next row(s), column 1 still contains the same cluster representative ID, but column 2 will contain ID of a different cluster member
Please see example below:
Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member
The input file is very large ~20GB, which is much more than my machine RAM. I suppose one way is to process such a large input is to break the input file into pieces that can be held in RAM, right? The other way, I hoping to get help here for, is to process the input straight-away while reading it in from the file handle, without writing to some large hash or array that crashes my machine! Usually I save the input to hash or array, so processing while reading in lines would be new to me, hence this request for help
The output I need to generate from this input should be as follows:
Osat_a Atha_b, Mtru_c Fves_d, Osat_e, Atha_f Atha_g Osat_h, Atha_i Mtru_j
Thanks, in advance, for your algorithm advice
In reply to Processing while reading in input by onlyIDleft
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |