I have this regular expression:
"(WORD1.*WORD2.*WORD3)|(WORD1.*WORD3.*WORD2)|(WORD2.*WORD1.*WORD3)|(WORD2.*WORD3.*WORD1)|(WORD3.*WORD1. *WORD2)|(WORD3.*WORD2.*WORD1)"
It matches these words:
WORD1WORD2WORD3 WORD1AWORD2BWORD3C WORD3WORD1WORD2 WORD1WORD2WORD3WORD1
But not these words:
WORD1WORD1WORD2 WORD1AWORD1BWORD2C
This regex matches when it finds a string containing 3 words in any order (WORD1
, WORD2
, WORD3
) .
I want to do the same thing with more words, but the problem is that the size of the regex grows exponentially with the number of words. Is it possible to simplify the way this regex is constructed to solve this problem (without growing exponentially in size)?
Simply iterate over all strings and filter out all strings that do not contain all keywords:
(A more concise version can be found in the code snippet below)
try it:
You can use positive lookahead for each word.
A more performant version below specifies the starting anchor and only matches a single character after validating the lookahead. As requested by the OP, this technique only works with
matching
, notextraction
.Forward lookahead is like a gate, it will only continue if the match specified within the brackets exists, but it will not consume or capture what it matches - it is always zero length. If you "look ahead" to see if there is
.*
before each word, the order of the words doesn't matter. If each word is true, proceed without using anything for matching. p>If you only care about whether the content matches, the only substantial difference between the two expressions is the time they take. Let’s say you only have 2 of the 3 required words in your content. Unless the software interpreting the expression can recognize that the attempt is futile, it might look for the three words "failed" in the first position, then try "failed" in the second position, and so on until it reaches the last position. give up. By specifying
^
, only the first position will be checked, saving time on other unnecessary checks. Removing the*
from the end can prevent some unnecessary catches when you are just looking for the true/false answer of whether all words are present in the content.