in reply to Re^4: Multi-thread combining the results together
in thread Multi-thread combining the results together
As a note, the array of @tokens are all unique. For each token, I want it either fully copied or nothing (a yes/no situation for each of the 80K tokens). A typical regex will have 10-14 terms and produces a result set of about 6 results from 80K possibilities.
If I can get maybe a 3x from algorithm improvements and another 3x from parallelization. I would be in the <10 minute max run time range which is "good enough". As it turns out in practice, not every possibility needs to be run and when a token needs to be investigated further for "close matches", I cache the result. More than a decade ago, run time was 20 minutes max on an Win 95 machine. One of the "problems" with software that "works" is that it often winds up being applied to larger and larger data sets. The 80K terms are extracted from 3 million input lines. 12 years ago, this was only 200K input lines and much smaller @tokens array!
I appreciate all of the ideas in this thread! I have a lot of experimentation ahead of me.
Ultimately, I would like to develop an algorithm that builds some kind of a tree structure which can be traversed much faster than any regex approach. I figure that will be non-trivial to accomplish.
Update: I tried the idea of using a multi-line, match global upon a string of \n separated tokens instead of running a regex on each token individually. This didn't work. This is significantly slower than the current code. It produces the same result, albeit slower. Next up: I will try the \b idea.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: Multi-thread combining the results together
by vr (Curate) on Jul 27, 2019 at 17:30 UTC |