A simplification of the brute force method would be to split on
\W, and use a hash to count the frequencies of the "words". That avoids generating the canonical list of substrings from every URL (which is a huge list, even for a single URL of significant length).
-QM
--
Quantum Mechanics: The dreams stuff is made of