thanks @przemo. I'll check out Bit::Vector. We think there are only about 400,000 unique characteristic sets among the 8 million patents, so we'll probably end up dealing with that data set instead, on account of the absurd amount of data this program would return.