On the whole, I think we need more information. But here are some vague questions:
You have taken on an N^2 problem. Most of the work is probably going to be spent interating and comparing. As such, threading is not going to buy you that much. Is this is a multi-CPU box?
Is the list in memory in a raw or processed state? That is, does it need parsing/evaluating/decoding every time, or is that done once? What I'm trying to ask is: is the list in memory in an access-efficient structure? A little processing time up-front may make yield speedier access.
Perhaps you can eliminate half of the comparisons by somehow marking the data once you have done AB to prevent BA, as you suggested.
Silly question maybe, but is there a faster machine with more memory around to use? Or more CPUs?
Once your algorithm is good, PerlMonks will be able to improve the implementation of it significantly.