For this challenge, we'll be using a simplified dialect of regular expressions, where:
- A lowercase letter from
atozmatches itself. (expr1|expr2)matches either ofexpr1andexpr2. You will not have to handle more than two operands, and you can assume there will always be parentheses around this. Instead of(a|b|c)you'll recieve((a|b)|c).(expr)*matches zero or more copies ofexpr. You can assume there will always be parentheses around the expression, i.e. you'll recieve(a)*and nota*.(expr)?matches either nothing orexpr. You can assume there will always be parentheses around the expression, i.e. you'll recieve(a)?and nota?.expr1expr2matchesexpr1followed byexpr2. If taking an AST (see below) you may assume concatenation will always have exactly two arguments i.e.["concat", ["concat",["a","b"]],"c"]to representabcrather than["concat",["a","b","c"]].
You may instead take the expression as some sort of AST. For example, this would be an acceptable format for the regex (be)*((a)?|((c|d))*), but you can use any reasonable format as long as it doesn't contain any information not present in the original regex. You may assume the expression and all subexpressions are nonempty - ()*, (a|), ()? and similar will never occur in the input. You can also assume there will be no redundant parentheses - ((a)), (a)b and similar will never occur.
Your challenge is to, given a simplified regex in this format, output all finite strings that the regex matches. However, there will be (countably) infinitely many strings that are matched, and every single one of them must appear at some finite index in the output. You can assume the regex will have infinitely many potential matches, and your output may contain duplicate strings so long as every string eventually appears.
For example, given (a)*(b)*, your output should contain all strings consisting of zero or more as followed by zero or more bs. Given this regex, outputting , a, aa, aaa, aaaa, aaaaa... (note the empty string) would not be allowed as this output would never include any string containing one or more bs. Outputting , a, b, aa, ab, bb, aaa, aab, abb, bbb, aaaa, aaab, aabb, abbb, bbbb..., however, using Cantor diagonalization to enumerate all possibilities, is fine as it will eventually output every string.
Standard sequence output rules apply - you may output all matches infinitely, or take an 0/1-indexed n and output the nth / first n terms. If you output multiple strings, they should be output either as an array or separated by some consistent non-alphabetic delimiter. This is code-golf, shortest wins!
Testcases
These are potential sequences of the first 20 terms for the given regex. It's fine if your solution outputs a different sequence as long as it eventually outputs everything.
a(b)* -> a, ab, abb, abbb, abbbb, abbbbb, abbbbbb, abbbbbbb, abbbbbbbb, abbbbbbbbb, abbbbbbbbbb, abbbbbbbbbbb, abbbbbbbbbbbb, abbbbbbbbbbbbb, abbbbbbbbbbbbbb, abbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbbb, abbbbbbbbbbbbbbbbbbb
(a)*(b)* -> , b, a, bb, ab, aa, bbb, abb, aab, aaa, bbbb, abbb, aabb, aaab, aaaa, bbbbb, abbbb, aabbb, aaabb, aaaab
(a)*(c)?(b)*d -> d, cd, bd, cbd, ad, acd, bbd, cbbd, abd, acbd, aad, aacd, bbbd, bbbcd, abbd, acbbd, aabd, aacbd, aaad, aaacd
((a|b))* -> , a , b , aa , ab , ba , bb , aaa , aab , aba , abb , baa , bab , bba , bbb , aaaa , aaab , aaba , aabb , abaa
(a(a)*b)* -> , ab, aab, aaab, abab, aaaab, aabab, abaab, aaaaab, aaabab, aabaab, abaaab, ababab, aaaaaab, aaaabab, aaabaab, aabaaab, abaaaab, aababab, abaabab
(a)*(a)* -> , a, aa, aaa, aaaa, aaaaa, aaaaaa, aaaaaaa, aaaaaaaa, aaaaaaaaa, aaaaaaaaaa, aaaaaaaaaaa, aaaaaaaaaaaa, aaaaaaaaaaaaa, aaaaaaaaaaaaaa, aaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaa
((a)*)* -> , a, aa, aaa, aaaa, aaaaa, aaaaaa, aaaaaaa, aaaaaaaa, aaaaaaaaa, aaaaaaaaaa, aaaaaaaaaaa, aaaaaaaaaaaa, aaaaaaaaaaaaa, aaaaaaaaaaaaaa, aaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaa
(((a)?|b))* -> , a , b , aa , ab , ba , bb , aaa , aab , aba , abb , baa , bab , bba , bbb , aaaa , aaab , aaba , aabb , abaa
(a(b)?c)* -> , ac , abc , acac , acabc , abcac , abcabc , acacac , acacabc , acabcac , acabcabc , abcacac , abcacabc , abcabcac , abcabcabc , acacacac , acacacabc , acacabcac , acacabcabc , acabcacac
((ab|cd))* -> , ab , cd , abab , abcd , cdab , cdcd , ababab , ababcd , abcdab , abcdcd , cdabab , cdabcd , cdcdab , cdcdcd , abababab , abababcd , ababcdab , ababcdcd , abcdabab
-
\$\begingroup\$ @DLosc Sure, I'll add that to the spec. \$\endgroup\$emanresu A– emanresu A2024年04月26日 07:21:20 +00:00Commented Apr 26, 2024 at 7:21
-
\$\begingroup\$ Is it okay if solutions that output all matches use inconsistent separators between matches--for example, sometimes space and sometimes newline? \$\endgroup\$DLosc– DLosc2024年04月26日 08:34:23 +00:00Commented Apr 26, 2024 at 8:34
-
\$\begingroup\$ @DLosc I'll say no to that one, clarified the text a bit \$\endgroup\$emanresu A– emanresu A2024年04月26日 08:36:16 +00:00Commented Apr 26, 2024 at 8:36
10 Answers 10
Python, 214 bytes
lambda r,i:g(r,i)[0]
def g(r,i,s=''):
try:
c,k=r;j=i//2
if c>2:R=g(k[i%2],j)
elif c>1:
for r in k:S,i=g(r,i);s+=S
R=s,i
else:
while i%2:S,i=g(k,j);s+=S;j=i//2;i*=c
R=s,j
except:R=r,i
return R
I guess we need at least one answer that solves the problem from scratch, so here is one. The regex is input as an AST in the following form:
(A|B) -> (ALT, [A, B])
ABC.. -> (SEQ, [A, B, C, ..])
(A)* -> (STAR, A)
(A)? -> (OPT, A)
a -> 'a'
where ALT, SEQ, STAR, and OPT are integer constants.
The lambda at the top takes a regex and a 0-based index, and outputs a generated string.
The index i is treated as a stream of bits (which ends with infinitely many zeros), and all decisions are made in binary fashion:
- Given an atom, simply return that.
- Given an alteration
(A|B), consume one bit and choose to generate from eitherAorB. - Given a sequence
ABC.., consume some bits to generate fromA, next bits forB, and so on. - Given a star
(A)*, consume one bit. If it is zero, stop. Otherwise, consume some bits to generate fromAand return to the beginning. This always terminates because there are only finitely many one bits ini. - Given
(A)?, proceed similarly to(A)*but stop after the first iteration.
This solves the challenge because every finite matching string has a path that generates it, which can be encoded in a finite number of bits. It is explicitly allowed to output the same string multiple times, and this solution actually outputs the same string infinitely many times (since the bits other than the part used to encode the output can be anything).
The way of passing around the remaining bits of i is similar to passing remaining buffer in recursive descent parsing.
-
2
-
1\$\begingroup\$ 202 from DLosc's by golfing the starting lambda \$\endgroup\$emanresu A– emanresu A2024年04月26日 20:56:02 +00:00Commented Apr 26, 2024 at 20:56
-
\$\begingroup\$ (actually 197 with SEQ=-1 and an oversight in DLosc's golfing) \$\endgroup\$emanresu A– emanresu A2024年04月27日 01:22:41 +00:00Commented Apr 27, 2024 at 1:22
-
\$\begingroup\$ 195 \$\endgroup\$emanresu A– emanresu A2024年04月27日 01:32:24 +00:00Commented Apr 27, 2024 at 1:32
JavaScript (Node.js), 72 bytes
s=>{for(i=9n,t='';;t=i.toString(36))t.replace(s,i)==i++&&console.log(t)}
-
1\$\begingroup\$ @Arnauld Should be fixed. \$\endgroup\$l4m2– l4m22024年04月26日 06:57:08 +00:00Commented Apr 26, 2024 at 6:57
Python + Regenerate, (削除) 49 (削除ここまで) 46 bytes
-3 bytes thanks to xnor
from regenerate import*
main(input(),[],1e999)
Takes the regex from stdin and outputs all matches to stdout.
Regenerate is a language that does exactly this (and more). The Python code takes a line of input and calls Regenerate's main function with that regex, an empty list of program inputs, and a repetition limit of infinity.
-
4\$\begingroup\$ It looks like
INFINITYisfloat('inf')so it suffices to write1e999which equals that. \$\endgroup\$xnor– xnor2024年04月26日 08:25:07 +00:00Commented Apr 26, 2024 at 8:25
Python, (削除) 214 (削除ここまで) (削除) 211 (削除ここまで) (削除) 198 (削除ここまで) 195 bytes
-3 bytes thanks to emanresu A
def f(r,n=0):
def m(r,i=-2):t,*R=r;p=lambda x,y:[a+b for a in x for b in y];return(t>3)*R or t>1and[p,list.__add__][t%2](*map(m,R))or(i<n*t)*R and[""]+p(m(R),m(r,i+1))
*map(print,m(r)),f(r,n+1)
Call f with a regex AST argument, get infinite matches printed to stdout. The AST is structured as follows (adapted from Bubbler's answer):
a -> [4, 'a']
(A|B) -> [3, A, B]
AB -> [2, A, B]
ABC -> [2, A, [2, B, C]]
(A)* -> [1] + A
(A)? -> [0] + A
where [1] + A means "concatenate a 1 to the front of list A."
I've been golfing this for... a while... and it's way too late. So the detailed explanation will have to wait. The basic idea, though, is that we let n be the max repetitions for the * operator and recurse the function infinitely with increasing values of n. There's a lot of redundant output, but the correct outputs are all there.
-
2\$\begingroup\$ mildly cursed 212 with _recursion_TM \$\endgroup\$emanresu A– emanresu A2024年04月26日 09:59:54 +00:00Commented Apr 26, 2024 at 9:59
Python + numpy, 94 bytes
def f(s,i=0):re.sub(s,"",t:=numpy.base_repr(i,36).lower())or print(t);f(s,i+1)
import re,numpy
Uses the iterate-all-strings technique from @l4m2's answer
Java 10, 120 bytes
r->{var i=java.math.BigInteger.ONE;for(var t="";;i=i.add(i.ONE),t=i.toString(36))if(t.matches(r))System.out.println(t);}
Port of @l4m2's JavaScript answer, so make sure to upvote that answer as well.
Explanation:
r->{ // Method with String parameter and no return-type
var i=java.math.BigInteger.ONE; // Loop-BigInteger `i`, starting at 1
for(var t=""; // Temp-String `t`, starting empty
// Loop indefinitely:
; // After every iteration:
i=i.add(i.ONE), // Increase `i` by 1
t=i.toString(36)) // Change `t` to `i` as base-36 string
if(t.matches(r)) // If `t` fully† matches the input-regex:
System.out.println(t);} // Print `t` with trailing newline
† Java's String#matches implicitly adds a leading ^ and trailing $ to match the regex on the entire String.
-
\$\begingroup\$ The language should be "Java 10", because Java 8 has no
var. \$\endgroup\$corvus_192– corvus_1922024年04月26日 11:59:25 +00:00Commented Apr 26, 2024 at 11:59 -
\$\begingroup\$ @corvus_192 You're completely right! I've fixed the header and JDK link. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2024年04月26日 12:54:25 +00:00Commented Apr 26, 2024 at 12:54
Haskell + hgl, (削除) 43 (削除ここまで), (削除) 40 (削除ここまで) 36 bytes
-3 bytes thanks to Wheat Wizard
-6 bytes
Requires -XExtendedDefaultRules
import P
s=wn(eF 1>~rM β)<ø<<gcx
Ungolfed:
β = ['a'..'z']
ø = null
allStrings = [1..] >>= (`replicateM` β)
s = filterNot allStrings . ø << allCompleteRegexMatches
-
\$\begingroup\$
(nø<)<gcxcan benø<<gcx. And if you replacewhwithwnyou can replacenøwithøto save a byte. Having to useeF(1::Int)is really annoying. I'll think of some way to not have to do that. \$\endgroup\$2024年06月20日 22:20:25 +00:00Commented Jun 20, 2024 at 22:20 -
\$\begingroup\$ Thank you, edited \$\endgroup\$corvus_192– corvus_1922024年06月22日 16:13:40 +00:00Commented Jun 22, 2024 at 16:13
-
\$\begingroup\$ Ok, I realized a way to get around the
eF(1::Int)issue. ReplaceeF(1::Int)>~rM βwithsQ~<ix(β:)ito save 5 bytes. \$\endgroup\$2024年06月22日 20:35:52 +00:00Commented Jun 22, 2024 at 20:35 -
\$\begingroup\$ Also worth pointing out that I personally would score this as being 11 bytes shorter than the score given, I generally don't score the
import Psince it's required to enable the hgl part of Haskell + hgl, ands=is not generally scored for point free functions submissions. You're free to score this whichever way you want as far as I'm concerned, but I wanted to mention it. \$\endgroup\$2024年06月22日 20:35:57 +00:00Commented Jun 22, 2024 at 20:35 -
1\$\begingroup\$ @WheatWizard I found out you can use
-XExtendedDefaultRulesto get type defaulting like in GHCi. I don't even know whether it defaults to Int or Integer, but either way works in this case. \$\endgroup\$corvus_192– corvus_1922024年07月16日 20:54:04 +00:00Commented Jul 16, 2024 at 20:54
05AB1E, 31 (or 30?) bytes
áÞæJÙʒ’St•–.•«?("ÿ",~r/^ÿ$/)’.E
Outputs an infinite list.
Could be -1 byte if we're allowed to output duplicated results, by removing the Ù:
Explanation:
05AB1E lacks regexes, but we can use .E to execute a string as Elixir code. So this uses Elixir's String.match.
á # Only leave the letters of the (implicit) input-regex
Þ # Convert it to an infinite cycled list
æ # Pop and get the powerset of this
J # Join each inner list of characters to a string
Ù # (optionally) Uniquify these stringsʒ’St•–.•«?("ÿ",~r/^ÿ$/)’.E
ʒ # Filter this infinite list of strings by:
’St•–.•«?("ÿ",~r/^ÿ$/)’
# Push dictionary string 'String.match("ÿ",~r/^ÿ$/)', where the first `ÿ`
# is replaced with the current string and the second with the (implicit)
# input-regex
.E # Pop and evaluate it as Elixir code, resulting in true/false
# (after which the filtered infinite list is output implicitly as result)
See this 05AB1E tip of mine (section How to use the dictionary?) to understand why ’St•–.•«?("ÿ",~r/^ÿ$/)’ is String.match("ÿ",~r/^ÿ$/).
-
\$\begingroup\$ Indeed, duplicated results are allowed: "your output may contain duplicate strings so long as every string eventually appears." \$\endgroup\$DLosc– DLosc2024年04月26日 19:08:24 +00:00Commented Apr 26, 2024 at 19:08
Charcoal, 60 bytes
≔↨N2θ⊞υAW∧υ⊟υ≡§ι0ωF⮌Φιλ⊞υκ?F∧θ⊟θ⊞υ§ι1*W∧θ⊟θ⊞υ§ι1|⊞υ§ι∨∧θ⊟θ2ι
Try it online! Link is to verbose version of code. Takes as input the 0-indexed match and the regex as an AST (see below). Explanation: Port of @Bubbler's Python answer.
≔↨N2θ
Input n and convert it to binary.
⊞υA
Start by processing the whole regex.
W∧υ⊟υ
Repeat until there are no more terms to process.
≡§ι0
Switch over the first element of the term.
ωF⮌Φιλ⊞υκ
If this is a concatenation then push all of the child terms (in reverse order, since they'll be processed by popping).
?F∧θ⊟θ⊞υ§ι1
If this is an option then push the child term only if the next bit is a 1.
*W∧θ⊟θ⊞υ§ι1
If this is an option then repeatedly push the child term while the next bit is a 1.
|⊞υ§ι∨∧θ⊟θ2
If this is an alternation then push one of the two child terms depending on the value of the next bit.
ι
Otherwise just print the term.
Terms have the following form:
('', terms...) Concatenation
('?', term) Option
('*', term) Repetition
('|', term, term) Alternation
Note that as an optimisation if all of the terms are letters then you can use the string formed by concatenating them together e.g. '?a' is equivalent to ('?', 'a') while ('', 'a', 'b') is equivalent to ab.
Anything that does not match the above is treated as a literal.
63 bytes to output the first n matches (plus a leading blank line):
FEN↨ι2«⸿⊞υηW∧υ⊟υ≡§κ0ωF⮌Φκμ⊞υλ?F∧ι⊟ι⊞υ§κ1*W∧ι⊟ι⊞υ§κ1|⊞υ§κ∨∧ι⊟ι2κ
Try it online! Link is to verbose version of code. Takes as input the number of matches and the regex as an AST as above.
Explore related questions
See similar questions with these tags.