onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:
I have a tab-separated 2 column input file with clustering information where column 1 contains the ID of the cluster representative, while column 2 contains the ID of the cluster member
Put differently, if there are 1000 elements clustered into 50 clusters, my input file will have 1000 lines with the ID of the cluster member in column 2, and the ID of the cluster representative in column 1
Therefore, the 1st line for each cluster will necessarily contain 2 identical columns, i.e. cluster representative and cluster member are identical
If there is more than one member of a cluster, then in the next row(s), column 1 still contains the same cluster representative ID, but column 2 will contain ID of a different cluster member
Please see example below:
Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member
The input file is very large ~20GB, which is much more than my machine RAM. I suppose one way is to process such a large input is to break the input file into pieces that can be held in RAM, right? The other way, I hoping to get help here for, is to process the input straight-away while reading it in from the file handle, without writing to some large hash or array that crashes my machine! Usually I save the input to hash or array, so processing while reading in lines would be new to me, hence this request for help
The output I need to generate from this input should be as follows:
Osat_a Atha_b, Mtru_c Fves_d, Osat_e, Atha_f Atha_g Osat_h, Atha_i Mtru_j
Thanks, in advance, for your algorithm advice
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Processing while reading in input
by tybalt89 (Monsignor) on Sep 20, 2018 at 00:34 UTC | |
by onlyIDleft (Scribe) on Sep 20, 2018 at 02:13 UTC | |
by AnomalousMonk (Archbishop) on Sep 20, 2018 at 03:47 UTC | |
|
Re: Processing while reading in input
by AnomalousMonk (Archbishop) on Sep 20, 2018 at 00:29 UTC | |
by onlyIDleft (Scribe) on Sep 20, 2018 at 02:37 UTC | |
|
Re: Processing while reading in input
by AnomalousMonk (Archbishop) on Sep 20, 2018 at 06:16 UTC | |
|
Re: Processing while reading in input
by LanX (Saint) on Sep 20, 2018 at 00:10 UTC |