in reply to How would you extract *content* from websites?
Here's the node I was talking about:
Imploding URLs
The connection may not be readily apparently, but the problem is essentially the same, only on a much larger scale. You can probably speed comparisons up some by storing all the words in an array and then converting them to a value corresponding to their subscript. You should only need two bytes per word. You can also speed things up by doing detailed comparisons only between pages that haven't had their common material determined yet. If page A and page B have common material x, and page C also has all of x, then you can be pretty sure that it doesn't need to be checked. And you can speed things up by starting comparisons with pages closest in location to the current page - differences in query string first, then in page name second, then a single folder, and so on.
|
|---|