Re^2: Count number of occurrences of a list of words in a file

Hi,

Thanks for your reply !

I've edited the faulty variables out, I went too fast making them more readable. It should be working code now.

While the other answers were good, this one is the best for my particular need, as I needed to define quite precisely what should match or not, so that excluding the rest was not a good option, even if it's quicker.

Comment on Re^2: Count number of occurrences of a list of words in a file

Replies are listed 'Best First'.
Re^3: Count number of occurrences of a list of words in a file by AnomalousMonk (Archbishop) on May 11, 2018 at 19:44 UTC
Thank you for the compliment, and I'm glad that my suggestion was helpful to you. I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a `$rx_word` regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary. It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Count number of occurrences of a list of words in a file
by AnomalousMonk (Archbishop) on May 11, 2018 at 19:44 UTC

Thank you for the compliment, and I'm glad that my suggestion was helpful to you.

I'm curious about your ultimate solution. As pointed out by Veltro here, the methods used by Athanasius, Tux and myself for enumeration of potential words are essentially identical. The differences in approach are between the split/exclusion and regex/extraction (as I would characterize them) methods used for finding candidate "words." Were you able to define a $rx_word regex object that had relatively few false positives (and, of course, absolutely no false negatives)? If so, it would be interesting to know what this definition is if it isn't so specific to your application as to be meaningless to others, or too proprietary.

It would be of even greater interest to me if you were able to get the Building Regex Alternations Dynamically approach working and if it is advantageous in terms of speed. As I mentioned in my reply (now with more updates!), my expectation was that a 60K word list was too big to be encompassed by a regex alternation; I no longer believe this. If you were able to use this technique and it proved beneficial, I'd like to hear about it!

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]