Enumerate all matches of a regex

Question 1

For this challenge, we'll be using a simplified dialect of regular expressions, where:

A lowercase letter from a to z matches itself.
(expr1|expr2) matches either of expr1 and expr2. You will not have to handle more than two operands, and you can assume there will always be parentheses around this. Instead of (a|b|c) you'll recieve ((a|b)|c).
(expr)* matches zero or more copies of expr. You can assume there will always be parentheses around the expression, i.e. you'll recieve (a)* and not a*.
(expr)? matches either nothing or expr. You can assume there will always be parentheses around the expression, i.e. you'll recieve (a)? and not a?.
expr1expr2 matches expr1 followed by expr2. If taking an AST (see below) you may assume concatenation will always have exactly two arguments i.e. ["concat", ["concat",["a","b"]],"c"] to represent abc rather than ["concat",["a","b","c"]].

You may instead take the expression as some sort of AST. For example, this would be an acceptable format for the regex (be)*((a)?|((c|d))*), but you can use any reasonable format as long as it doesn't contain any information not present in the original regex. You may assume the expression and all subexpressions are nonempty - ()*, (a|), ()? and similar will never occur in the input. You can also assume there will be no redundant parentheses - ((a)), (a)b and similar will never occur.

Your challenge is to, given a simplified regex in this format, output all finite strings that the regex matches. However, there will be (countably) infinitely many strings that are matched, and every single one of them must appear at some finite index in the output. You can assume the regex will have infinitely many potential matches, and your output may contain duplicate strings so long as every string eventually appears.

For example, given (a)*(b)*, your output should contain all strings consisting of zero or more as followed by zero or more bs. Given this regex, outputting , a, aa, aaa, aaaa, aaaaa... (note the empty string) would not be allowed as this output would never include any string containing one or more bs. Outputting , a, b, aa, ab, bb, aaa, aab, abb, bbb, aaaa, aaab, aabb, abbb, bbbb..., however, using Cantor diagonalization to enumerate all possibilities, is fine as it will eventually output every string.

Standard sequence output rules apply - you may output all matches infinitely, or take an 0/1-indexed n and output the nth / first n terms. If you output multiple strings, they should be output either as an array or separated by some consistent non-alphabetic delimiter. This is code-golf, shortest wins!

Testcases

These are potential sequences of the first 20 terms for the given regex. It's fine if your solution outputs a different sequence as long as it eventually outputs everything.

a(b)* -> a, ab, abb, abbb, abbbb, abbbbb, abbbbbb, abbbbbbb, abbbbbbbb, abbbbbbbbb, abbbbbbbbbb, abbbbbbbbbbb, abbbbbbbbbbbb, abbbbbbbbbbbbb, abbbbbbbbbbbbbb, abbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbbbb
(a)*(b)* -> , b, a, bb, ab, aa, bbb, abb, aab, aaa, bbbb, abbb, aabb, aaab, aaaa, bbbbb, abbbb, aabbb, aaabb, aaaab
(a)*(c)?(b)*d -> d, cd, bd, cbd, ad, acd, bbd, cbbd, abd, acbd, aad, aacd, bbbd, bbbcd, abbd, acbbd, aabd, aacbd, aaad, aaacd
((a|b))* -> , a , b , aa , ab , ba , bb , aaa , aab , aba , abb , baa , bab , bba , bbb , aaaa , aaab , aaba , aabb , abaa
(a(a)*b)* -> , ab, aab, aaab, abab, aaaab, aabab, abaab, aaaaab, aaabab, aabaab, abaaab, ababab, aaaaaab, aaaabab, aaabaab, aabaaab, abaaaab, aababab, abaabab
(a)*(a)* -> , a, aa, aaa, aaaa, aaaaa, aaaaaa, aaaaaaa, aaaaaaaa, aaaaaaaaa, aaaaaaaaaa, aaaaaaaaaaa, aaaaaaaaaaaa, aaaaaaaaaaaaa, aaaaaaaaaaaaaa, aaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaa
((a)*)* -> , a, aa, aaa, aaaa, aaaaa, aaaaaa, aaaaaaa, aaaaaaaa, aaaaaaaaa, aaaaaaaaaa, aaaaaaaaaaa, aaaaaaaaaaaa, aaaaaaaaaaaaa, aaaaaaaaaaaaaa, aaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaa
(((a)?|b))* -> , a , b , aa , ab , ba , bb , aaa , aab , aba , abb , baa , bab , bba , bbb , aaaa , aaab , aaba , aabb , abaa
(a(b)?c)* -> , ac , abc , acac , acabc , abcac , abcabc , acacac , acacabc , acabcac , acabcabc , abcacac , abcacabc , abcabcac , abcabcabc , acacacac , acacacabc , acacabcac , acacabcabc , acabcacac
((ab|cd))* -> , ab , cd , abab , abcd , cdab , cdcd , ababab , ababcd , abcdab , abcdcd , cdabab , cdabcd , cdcdab , cdcdcd , abababab , abababcd , ababcdab , ababcdcd , abcdabab

Question 2

@DLosc Sure, I'll add that to the spec.

Question 3

Is it okay if solutions that output all matches use inconsistent separators between matches--for example, sometimes space and sometimes newline?

Question 4

@DLosc I'll say no to that one, clarified the text a bit

Question 5

Python, 214 bytes

lambda r,i:g(r,i)[0]
def g(r,i,s=''):
 try:
 c,k=r;j=i//2
 if c>2:R=g(k[i%2],j)
 elif c>1:
 for r in k:S,i=g(r,i);s+=S
 R=s,i
 else:
 while i%2:S,i=g(k,j);s+=S;j=i//2;i*=c
 R=s,j
 except:R=r,i
 return R

Attempt This Online!

I guess we need at least one answer that solves the problem from scratch, so here is one. The regex is input as an AST in the following form:

(A|B) -> (ALT, [A, B])
ABC.. -> (SEQ, [A, B, C, ..])
(A)* -> (STAR, A)
(A)? -> (OPT, A)
a -> 'a'

where ALT, SEQ, STAR, and OPT are integer constants.

The lambda at the top takes a regex and a 0-based index, and outputs a generated string.

The index i is treated as a stream of bits (which ends with infinitely many zeros), and all decisions are made in binary fashion:

Given an atom, simply return that.
Given an alteration (A|B), consume one bit and choose to generate from either A or B.
Given a sequence ABC.., consume some bits to generate from A, next bits for B, and so on.
Given a star (A)*, consume one bit. If it is zero, stop. Otherwise, consume some bits to generate from A and return to the beginning. This always terminates because there are only finitely many one bits in i.
Given (A)?, proceed similarly to (A)* but stop after the first iteration.

This solves the challenge because every finite matching string has a path that generates it, which can be encoded in a finite number of bits. It is explicitly allowed to output the same string multiple times, and this solution actually outputs the same string infinitely many times (since the bits other than the part used to encode the output can be anything).

The way of passing around the remaining bits of i is similar to passing remaining buffer in recursive descent parsing.

Question 6

205 bytes by encoding atoms as (ATOM, 'a') with ATOM=4.

Question 7

202 from DLosc's by golfing the starting lambda

Question 8

(actually 197 with SEQ=-1 and an oversight in DLosc's golfing)

Question 9

195

Question 10

JavaScript (Node.js), 72 bytes

s=>{for(i=9n,t='';;t=i.toString(36))t.replace(s,i)==i++&&console.log(t)}

Try it online!

Question 11

@Arnauld Should be fixed.

Question 12

Python + Regenerate, (削除) 49 (削除ここまで) 46 bytes

-3 bytes thanks to xnor

from regenerate import*
main(input(),[],1e999)

Takes the regex from stdin and outputs all matches to stdout.

Try it online!

Regenerate is a language that does exactly this (and more). The Python code takes a line of input and calls Regenerate's main function with that regex, an empty list of program inputs, and a repetition limit of infinity.

Question 13

It looks like INFINITY is float('inf') so it suffices to write 1e999 which equals that.

Question 14

Python, (削除) 214 (削除ここまで) (削除) 211 (削除ここまで) (削除) 198 (削除ここまで) 195 bytes

-3 bytes thanks to emanresu A

def f(r,n=0):
 def m(r,i=-2):t,*R=r;p=lambda x,y:[a+b for a in x for b in y];return(t>3)*R or t>1and[p,list.__add__][t%2](*map(m,R))or(i<n*t)*R and[""]+p(m(R),m(r,i+1))
 *map(print,m(r)),f(r,n+1)

Attempt This Online!

Call f with a regex AST argument, get infinite matches printed to stdout. The AST is structured as follows (adapted from Bubbler's answer):

a -> [4, 'a']
(A|B) -> [3, A, B]
AB -> [2, A, B]
ABC -> [2, A, [2, B, C]]
(A)* -> [1] + A
(A)? -> [0] + A

where [1] + A means "concatenate a 1 to the front of list A."

I've been golfing this for... a while... and it's way too late. So the detailed explanation will have to wait. The basic idea, though, is that we let n be the max repetitions for the * operator and recurse the function infinitely with increasing values of n. There's a lot of redundant output, but the correct outputs are all there.

Question 15

mildly cursed 212 with _recursion_TM

Question 16

Python + numpy, 94 bytes

def f(s,i=0):re.sub(s,"",t:=numpy.base_repr(i,36).lower())or print(t);f(s,i+1)
import re,numpy

Attempt This Online!

Uses the iterate-all-strings technique from @l4m2's answer

Question 17

Perl 5 + `-M5.10.0 -a`, 30 bytes

}{/^@F$/&&say;$_++;y/1/a/;redo

Try it online!

Question 18

Java 10, 120 bytes

r->{var i=java.math.BigInteger.ONE;for(var t="";;i=i.add(i.ONE),t=i.toString(36))if(t.matches(r))System.out.println(t);}

Port of @l4m2's JavaScript answer, so make sure to upvote that answer as well.

Try it online.

Explanation:

r->{ // Method with String parameter and no return-type
 var i=java.math.BigInteger.ONE; // Loop-BigInteger `i`, starting at 1
 for(var t=""; // Temp-String `t`, starting empty
 // Loop indefinitely:
 ; // After every iteration:
 i=i.add(i.ONE), // Increase `i` by 1
 t=i.toString(36)) // Change `t` to `i` as base-36 string
 if(t.matches(r)) // If `t` fully† matches the input-regex:
 System.out.println(t);} // Print `t` with trailing newline

† Java's String#matches implicitly adds a leading ^ and trailing $ to match the regex on the entire String.

Question 19

The language should be "Java 10", because Java 8 has no var.

Question 20

@corvus_192 You're completely right! I've fixed the header and JDK link.

Question 21

Haskell + hgl, (削除) 43 (削除ここまで), (削除) 40 (削除ここまで) 36 bytes

-3 bytes thanks to Wheat Wizard
-6 bytes

Requires -XExtendedDefaultRules

import P
s=wn(eF 1>~rM β)<ø<<gcx

Attempt This Online!

Ungolfed:

β = ['a'..'z']
ø = null
allStrings = [1..] >>= (`replicateM` β)
s = filterNot allStrings . ø << allCompleteRegexMatches

Question 22

(nø<)<gcx can be nø<<gcx. And if you replace wh with wn you can replace nø with ø to save a byte. Having to use eF(1::Int) is really annoying. I'll think of some way to not have to do that.

Question 23

Thank you, edited

Question 24

Ok, I realized a way to get around the eF(1::Int) issue. Replace eF(1::Int)>~rM β with sQ~<ix(β:)i to save 5 bytes.

Question 25

Also worth pointing out that I personally would score this as being 11 bytes shorter than the score given, I generally don't score the import P since it's required to enable the hgl part of Haskell + hgl, and s= is not generally scored for point free functions submissions. You're free to score this whichever way you want as far as I'm concerned, but I wanted to mention it.

Question 26

@WheatWizard I found out you can use -XExtendedDefaultRules to get type defaulting like in GHCi. I don't even know whether it defaults to Int or Integer, but either way works in this case.

Question 27

05AB1E, 31 (or 30?) bytes

áÞæJÙʒ’St•–.•«?("ÿ",~r/^ÿ$/)’.E

Outputs an infinite list.

Try it online.

Could be -1 byte if we're allowed to output duplicated results, by removing the Ù:

Try it online.

Explanation:

05AB1E lacks regexes, but we can use .E to execute a string as Elixir code. So this uses Elixir's String.match.

á # Only leave the letters of the (implicit) input-regex
 Þ # Convert it to an infinite cycled list
 æ # Pop and get the powerset of this
 J # Join each inner list of characters to a string
 Ù # (optionally) Uniquify these stringsʒ’St•–.•«?("ÿ",~r/^ÿ$/)’.E
 ʒ # Filter this infinite list of strings by:
 ’St•–.•«?("ÿ",~r/^ÿ$/)’
 # Push dictionary string 'String.match("ÿ",~r/^ÿ$/)', where the first `ÿ`
 # is replaced with the current string and the second with the (implicit)
 # input-regex
 .E # Pop and evaluate it as Elixir code, resulting in true/false
 # (after which the filtered infinite list is output implicitly as result)

See this 05AB1E tip of mine (section How to use the dictionary?) to understand why ’St•–.•«?("ÿ",~r/^ÿ$/)’ is String.match("ÿ",~r/^ÿ$/).

Question 28

Indeed, duplicated results are allowed: "your output may contain duplicate strings so long as every string eventually appears."

Question 29

Charcoal, 60 bytes

≔↨N2θ⊞υAW∧υ⊟υ≡§ι0ωF⮌Φιλ⊞υκ?F∧θ⊟θ⊞υ§ι1*W∧θ⊟θ⊞υ§ι1|⊞υ§ι∨∧θ⊟θ2ι

Try it online! Link is to verbose version of code. Takes as input the 0-indexed match and the regex as an AST (see below). Explanation: Port of @Bubbler's Python answer.

≔↨N2θ

Input n and convert it to binary.

⊞υA

Start by processing the whole regex.

W∧υ⊟υ

Repeat until there are no more terms to process.

≡§ι0

Switch over the first element of the term.

ωF⮌Φιλ⊞υκ

If this is a concatenation then push all of the child terms (in reverse order, since they'll be processed by popping).

?F∧θ⊟θ⊞υ§ι1

If this is an option then push the child term only if the next bit is a 1.

*W∧θ⊟θ⊞υ§ι1

If this is an option then repeatedly push the child term while the next bit is a 1.

|⊞υ§ι∨∧θ⊟θ2

If this is an alternation then push one of the two child terms depending on the value of the next bit.

ι

Otherwise just print the term.

Terms have the following form:

('', terms...) Concatenation
('?', term) Option
('*', term) Repetition
('|', term, term) Alternation

Note that as an optimisation if all of the terms are letters then you can use the string formed by concatenating them together e.g. '?a' is equivalent to ('?', 'a') while ('', 'a', 'b') is equivalent to ab.

Anything that does not match the above is treated as a literal.

63 bytes to output the first n matches (plus a leading blank line):

FEN↨ι2«⸿⊞υηW∧υ⊟υ≡§κ0ωF⮌Φκμ⊞υλ?F∧ι⊟ι⊞υ§κ1*W∧ι⊟ι⊞υ§κ1|⊞υ§κ∨∧ι⊟ι2κ

Try it online! Link is to verbose version of code. Takes as input the number of matches and the regex as an AST as above.

Bubbler 79.3k5 gold badges162 silver badges484 bronze badges · Accepted Answer · 2024-04-26 06:13:59Z

Python, 214 bytes

lambda r,i:g(r,i)[0]
def g(r,i,s=''):
 try:
 c,k=r;j=i//2
 if c>2:R=g(k[i%2],j)
 elif c>1:
 for r in k:S,i=g(r,i);s+=S
 R=s,i
 else:
 while i%2:S,i=g(k,j);s+=S;j=i//2;i*=c
 R=s,j
 except:R=r,i
 return R

Attempt This Online!

I guess we need at least one answer that solves the problem from scratch, so here is one. The regex is input as an AST in the following form:

(A|B) -> (ALT, [A, B])
ABC.. -> (SEQ, [A, B, C, ..])
(A)* -> (STAR, A)
(A)? -> (OPT, A)
a -> 'a'

where ALT, SEQ, STAR, and OPT are integer constants.

The lambda at the top takes a regex and a 0-based index, and outputs a generated string.

The index i is treated as a stream of bits (which ends with infinitely many zeros), and all decisions are made in binary fashion:

Given an atom, simply return that.
Given an alteration (A|B), consume one bit and choose to generate from either A or B.
Given a sequence ABC.., consume some bits to generate from A, next bits for B, and so on.
Given a star (A)*, consume one bit. If it is zero, stop. Otherwise, consume some bits to generate from A and return to the beginning. This always terminates because there are only finitely many one bits in i.
Given (A)?, proceed similarly to (A)* but stop after the first iteration.

This solves the challenge because every finite matching string has a path that generates it, which can be encoded in a finite number of bits. It is explicitly allowed to output the same string multiple times, and this solution actually outputs the same string infinitely many times (since the bits other than the part used to encode the output can be anything).

The way of passing around the remaining bits of i is similar to passing remaining buffer in recursive descent parsing.

(actually 197 with SEQ=-1 and an oversight in DLosc's golfing)

Stack Exchange Network

Enumerate all matches of a regex

Testcases

10 Answers 10

Python, 214 bytes

JavaScript (Node.js), 72 bytes

Python + Regenerate, (削除) 49 (削除ここまで) 46 bytes

Python, (削除) 214 (削除ここまで) (削除) 211 (削除ここまで) (削除) 198 (削除ここまで) 195 bytes

Python + numpy, 94 bytes

Perl 5 + `-M5.10.0 -a`, 30 bytes

Java 10, 120 bytes

Haskell + hgl, (削除) 43 (削除ここまで), (削除) 40 (削除ここまで) 36 bytes

05AB1E, 31 (or 30?) bytes

Charcoal, 60 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Enumerate all matches of a regex

Testcases

10 Answers 10

Python, 214 bytes

JavaScript (Node.js), 72 bytes

Python + Regenerate, (削除) 49 (削除ここまで) 46 bytes

Python, (削除) 214 (削除ここまで) (削除) 211 (削除ここまで) (削除) 198 (削除ここまで) 195 bytes

Python + numpy, 94 bytes

Perl 5 + -M5.10.0 -a, 30 bytes

Java 10, 120 bytes

Haskell + hgl, (削除) 43 (削除ここまで), (削除) 40 (削除ここまで) 36 bytes

05AB1E, 31 (or 30?) bytes

Charcoal, 60 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Perl 5 + `-M5.10.0 -a`, 30 bytes