Generating a list of values a regex COULD match in Python

Question:

I’m trying to use a regex as an input, and from there generate all the possible values that the regex would match.

So, for example, if the regex is “three-letter words starting with a, and ending in c,” then the code would generate a list with the values [aac, abc, acc, adc, a1c….].

Is there an easy way to do this? I’m using python.

Asked By: mlissner

||

Answers:

You don’t want to do this. Most of the result sets will be huge, and some will be infinite. Instead use a sequence of test vectors and apply the regex against each in turn:

vectors = (
  'foo',
  'bar',
  ...
)

for result in (re.match(someregex, entry) for entry in vectors):
  ...

Some regular expressions match a finite number of input strings, but many (most?) match an infinite number of input strings. It’s kind of like asking ‘given the python language grammar, generate all possible python programs’. You probably could write a program to list them all in sequence if you tried (though it would take infinite time to run), but are you sure you want to? Why would you want to?

I’m pretty sure the regular expression engine in the standard library does not expose a way to generate the output you want. You’d have to get lower level access to the internal data structures, or implement some DFA engine thing-a-ma-bob yourself.

Answered By: Matt Anderson

Here’s a brute force solution that should work. It has a running time of O(L^max_length) (where L is the size of the alphabet), so use it at your own risk.

def all_matching_strings(alphabet, max_length, regex):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""

if max_length == 0: return 

L = len(alphabet)
for N in range(1, max_length+1):
    indices = [0]*N
    for z in xrange(L**N):
        r = ''.join(alphabet[i] for i in indices)
        if regex.match(r):                
           yield(r)

        i = 0
        indices[i] += 1
        while (i<N) and (indices[i]==L):
            indices[i] = 0
            i += 1
            if i<N: indices[i] += 1

return

example usage:

alphabet = 'abcdef1234567890'
import re
regex = re.compile('f*[1-3]+$')
for r in all_matching_strings(alphabet, 5, regex): 
    print r

which would output all strings up to length 5, starting with a sequence of f’s, and then a non empty sequence of 1-3, then ending:

1
2
3
f1
11
21
31
f2
12
22
32
f3
13
23
33
ff1
[more output omitted...]
Answered By: Ofri Raviv

The set of matching strings is infinite if and only if there is a quantifier (+ or *) in your regexp. Your question doesn’t seem to aim at those patterns. I rather believe that the product function from itertools might help here.

You might for instance introduce a special character indicating an arbitrary letter (e.g. an underscore), then build a pattern like this

patt = 'a_c'

and define your alphabet

youralphabet = 'abcde...'

and define a function generating all possible instances like this

def genInstances(patt):
    elems = [c if c != '_' else youralphabet for c in patt]
    return itertools.product(*elems)

You may then extend this approach to match real regexp by parsing your pattern for d or [a-zA-Z] or whatever.

Answered By: Johannes Charra

The xeger-function in the rstr-library does this: https://pypi.org/project/rstr/

Generating strings that match a given regular expression is possible for simple regular expressions (that is most regular expressions). This can be done much more efficiently than with the exponential-time brute-force approah of the accepted answer if the implementation makes use of the State-machine that the regular expression compiles to.

Efficiency will be a quality-of-implementation-issue. Many languages have a library for this feature so maybe check out other implementations like Javas Xeger or Javascripts randexp.js if speed is critical.

Answered By: julaine
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.