note
Marshall
I do like this general approach, however the OP is talking about a significant sized file of 500 MB. Depending upon the data of course, your HoH (hash of hash) structure could consume quite a bit more memory than the actual file size in MB. <p>
<c>
{
head1 => { val1 => 2, val2 => 1, val3 => 1, val6 => 1 },
head2 => { val2 => 1, val4 => 2, val7 => 2 },
head3 => { val3 => 2, val5 => 3 },
}
</c>
I came up with a representation (at [https://www.perlmonks.org/?node_id=11140228 | this post]) where the column values only occur once as hash keys and the value of each hash key is an array describing whether a value: appears or doesn't appear at all in column, whether a value only appears once in a column, whether a value occurs more than once in a column. <p>
We both interpreted "unique" to mean different things.<br>
I see you think that means: "don't repeat yourself after having said something once".<br>
I thought it meant: "don't say anything at all if you would repeat yourself".<p>
My data structure:
<c>
{
val1 => [-1], # val1 occurs more than once in col 1
val2 => [2, 1], # val2 occurs once in col 1 and once in col 2
val3 => [-3, 1], #val3 occurs more than once in col 3
#but only one time in col1
val4 => [-2],
val5 => [-3],
val6 => [1],
val7 => [-2], #val7 is mentioned at least twice in col2
}
</c>
Of course I could generate your same output from my data structure because I know the columns where the term appeared more than once.
11140211
11140230