The purpose of this task is to write a program or function to find the most common substring within a given string, of the specified length.
Inputs
- A string,
s
, of any length and characters that are valid as inputs for your language. - A number,
n
, indicating the length of substring to find. Guaranteed to be equal to or less than the length of the string.
Output
The most common (case-sensitive) substring of s
of length n
. In the case of a tie, any of the options may be output.
Win Criteria
This is code-golf, so lowest bytes wins; usual exceptions etc. apply.
Examples
Hello, World!
, 1
> l
(appears 3 times)
Hello! Cheerio!
, 2
> o!
(appears twice)
The Cat sat on the mat
, 2
> at
(appears three times)
The Cat sat on the mat
, 3
> at
or he
(both with trailing spaces. Both appear twice - note case-sensitivity so The
and the
are different substrings)
Mississippi
, 4
> issi
(appears twice, note that the two occurrences overlap each other)
Some more examples, all using the same input string*
Bb:maj/2
F:maj/5
Bb:maj/2
G:9/3
Bb:maj7
F:maj/3
G:min7
C:sus4(b7,9)
C:sus4(b7,9)
C:sus4(b7,9)
F:maj
F:maj
(note the trailing newline)
1 > \r\n
(appears 12 times, counting \r\n
as a single character - my choice, your code may differ on this - otherwise either \r
or \n
appear the same number of times). :
also appears 12 times
2 > :m
(appears 8 times)
3 > :ma
or maj
(appears 7 times)
4 > :maj
(appears 7 times)
5 > F:maj
or \r\nF:ma
or :maj/
(appears 4 times)
14 > \r\nC:sus4(b7,9)\r\n
(appears 3 times)
(* input is taken from How predictable is popular music?. The result of this task might be used, for example, to find the most efficient way to apply a substitution in order to compress the string)
21 Answers 21
Python 3.8, (削除) 75 (削除ここまで) (削除) 68 (削除ここまで) (削除) 97 (削除ここまで) 63 bytes
lambda s,n:max(l:=[s[j:j+n]for j in range(len(s))],key=l.count)
You can try it online! Increased byte count because, as @xnor kindly pointed out, Python's str.count
doesn't count overlapping substrings... But then @xnor's reformulation of what I was doing allowed to slice a third of the bytes!
How:
l:=[s[j:j+n]for j in range(len(s))]
This creates a list of all the substrings in s
of size n
, plus the suffixes of size n-1
, n-2
, ..., 1
. Then we find the max
on that list, with the numerical value being used to sort given by
l.count
That means that if a
and b
are from l
, a > b
if a
shows up more times in l
than b
. This means the shorter suffixes are never the result of the max
because they only show up once, and even if the correct-sized substrings only show up once, they are first in l
so they come out instead of the short suffixes. This allows me to save some bytes in the range
used, given that I don't have to prevent i+n
from being larger than len(s)
.
(削除) # Python, 83 (削除ここまで) 64 bytes
lambda s,n:max((s[i:i+n]for i in range(len(s)-n+1)),key=s.count)
Thanks to @mypetlion I saved a LOT of bytes :D
You can try it online with my very own incredible test case!
-
\$\begingroup\$ Replace
sorted(...)[-1]
withmax(...)
to save 7 bytes. \$\endgroup\$mypetlion– mypetlion2020年02月18日 18:17:14 +00:00Commented Feb 18, 2020 at 18:17 -
\$\begingroup\$ Replace
lambda t:s.count(t)
withs.count
to save 12 bytes. \$\endgroup\$mypetlion– mypetlion2020年02月18日 18:17:49 +00:00Commented Feb 18, 2020 at 18:17 -
1\$\begingroup\$ fixes typos \$\endgroup\$S.S. Anne– S.S. Anne2020年02月18日 23:25:17 +00:00Commented Feb 18, 2020 at 23:25
-
1\$\begingroup\$ @S.S.Anne gives credit, too \$\endgroup\$RGS– RGS2020年02月18日 23:31:37 +00:00Commented Feb 18, 2020 at 23:31
-
1\$\begingroup\$ It looks like this unfortunately doesn't get the
"Mississippi", 4
test case right, due to how Python counts substrings. \$\endgroup\$xnor– xnor2020年02月19日 08:04:58 +00:00Commented Feb 19, 2020 at 8:04
05AB1E, 5 bytes
ŒIù.M
Try it online or verify the smaller test cases or verify the larger test case (which is a bit slow).
Explanation:
Œ # Push all substrings of the (implicit) input-string
Iù # Only keep substrings of a length equal to the second input-integer
.M # Only keep the most frequent item of the remaining substrings
# (after which it is output implicitly as result)
Brachylog, 8 bytes
s)fọtoth
Explanation
f Find all...
s ...substrings of <1st element of input>...
) ...of length <2nd element of input>
ọ Get the list of each substring with its number of occurrence
to Order by the number of occurrence
t Take the last one
h Output the substring itself
JavaScript (ES6), (削除) 85 83 (削除ここまで) 80 bytes
Takes input as (string)(length)
.
s=>n=>[...s].map(o=m=(x,i)=>(o[k=s.substr(i,n)]=-~o[k])<m|!k[n-1]?0:m=o[O=k])&&O
Java 10, (削除) 212 (削除ここまで) (削除) 211 (削除ここまで) (削除) 196 (削除ここまで) (削除) 195 (削除ここまで) (削除) 146 (削除ここまで) 143 bytes
s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;}
-15 bytes by outputting the substrings with their count, which is allowed.
-1 byte thanks to @ceilingcat.
-52 bytes thanks to @OlivierGrégoire.
Explanation:
s->n->{ // Method with String and Integer parameters and String return-type
String r="", // Result-String, starting empty
t; // Temp-String
for(int m=0, // Largest count, starting at 0
c, // Temp count integer
j, // Temp index integer
i // Index-integer `i`
=s.length();i-->n;)
// Loop `i` in the range (input-length, n]:
for(c=0, // Reset the temp-count to 0
j=-1; // And the temp-index integer to -1
(j=s.indexOf(t=s.substring(i-n,i)
// Set String `t` to a substring of the input-String in the range [i-n,i)
j+1))// Set the temp-index `j` to the index of String `t` in the input-String,
// only looking at the trailing portion after index `j+1`
>=0;) // And continue looping as long as long as that substring is found
if(++c>m){// If the temp-count + 1 is larger than the current largest count:
m=c; // Set this current largest count to this temp-count + 1
r=t;} // And the result-String to this temp-String
return r;} // After the nested loops, return the result-String
-
1\$\begingroup\$ 146 bytes:
S->n->{String R="",x;for(int M=0,I=0,m,i;I<S.length()-n;I++)for(m=0,i=-1;(i=S.indexOf(x=S.substring(I,I+n),i+1))>=0;)if(++m>M){M=m;R=x;}return R;}
(no imports, and Java 8, not 10) \$\endgroup\$Olivier Grégoire– Olivier Grégoire2020年02月20日 09:04:03 +00:00Commented Feb 20, 2020 at 9:04 -
\$\begingroup\$ @OlivierGrégoire Thanks! That's a much better approach. I had the feeling it was possible without the Map and/or stream, but couldn't figure it out. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2020年02月20日 09:53:23 +00:00Commented Feb 20, 2020 at 9:53
-
1\$\begingroup\$ I also tried with
.split("\\Q"+substring+"\\E")
, without success because of the overlapping test case inMississippi
\$\endgroup\$Olivier Grégoire– Olivier Grégoire2020年02月20日 09:55:17 +00:00Commented Feb 20, 2020 at 9:55 -
\$\begingroup\$ 143 bytes:
s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;}
(just changing the iteration manner) \$\endgroup\$Olivier Grégoire– Olivier Grégoire2020年02月20日 10:04:28 +00:00Commented Feb 20, 2020 at 10:04
Perl 6, (削除) 49 (削除ここまで) 46 bytes
-3 bytes thanks to SirBogman
{$^n;bag($^a~~m:ov/.**{$n}/X~$).max(*{*}).key}
Explanation
{ } # Anonymous block
$^n; # Declare argument
$^a~~m / / # Regex match input string
:ov # Also return overlapping matches
.**{$n} # Match n-character strings
X~$ # Stringify matches
bag( ) # Convert to Bag (multiset)
.max(*{*}) # Find Pair string=>count with max value
.key # Get key (string)
-
-
\$\begingroup\$ @SirBogman Cool, this sure would have helped me a few times.
/$n/
is supposed to interpolate literal text (like\Q...\E
in Perl 5), so I can see why.**$n
doesn't work. \$\endgroup\$nwellnhof– nwellnhof2020年02月20日 19:25:29 +00:00Commented Feb 20, 2020 at 19:25 -
\$\begingroup\$ I don't fully understand what
.max(*{*})
is doing. It seems like both of those whatever parameters are Pairs? \$\endgroup\$SirBogman– SirBogman2020年02月20日 19:49:48 +00:00Commented Feb 20, 2020 at 19:49 -
1\$\begingroup\$ @SirBogman Only the first
*
is a whatever parameter. The second*
is simply to get a full slice of the Pair, returning a list containing its only value. So*{*}
is effectively the same as*.values
. \$\endgroup\$nwellnhof– nwellnhof2020年02月20日 22:20:29 +00:00Commented Feb 20, 2020 at 22:20 -
\$\begingroup\$ Cool. Those whatevers are very flexible. \$\endgroup\$SirBogman– SirBogman2020年02月21日 00:05:18 +00:00Commented Feb 21, 2020 at 0:05
PHP, (削除) 133 (削除ここまで) 126 bytes
for(;$k<=strlen($i=$argv[1])-$j=$argv[2];)$s[]=substr($i,$k++,$j);$c=array_count_values($s);asort($c);echo array_key_last($c);
Annoyingly asort()
(削除) and can't be chained into one expression.end()
(削除ここまで)
-7 bytes thanks to @manatwork
-
1\$\begingroup\$ PHP 7.3 has
array_key_last()
to reduce it to 126 characters. \$\endgroup\$manatwork– manatwork2020年02月18日 11:32:00 +00:00Commented Feb 18, 2020 at 11:32 -
\$\begingroup\$ I really should learn all my
array
functions. \$\endgroup\$Guillermo Phillips– Guillermo Phillips2020年02月18日 11:43:11 +00:00Commented Feb 18, 2020 at 11:43
JavaScript (Node.js), 92 bytes
s=>n=>s.map(F=t=>([F[w=(w+t).slice(-n)]=-~F[w],w]),w="").sort(([a],[b])=>a-b)[s.length-1][1]
Takes s
as a list of characters in (s)(n)
syntax.
-
\$\begingroup\$
=>a-b)[s.length-1]
to=>b-a)[0]
to sort in reverse order and save some bytes. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2020年02月18日 13:11:52 +00:00Commented Feb 18, 2020 at 13:11 -
1\$\begingroup\$ @KevinCruijssen That could fail if there are no repeating substrings. Try
f([..."123456"])(2)
. \$\endgroup\$Shieru Asakoto– Shieru Asakoto2020年02月18日 13:36:38 +00:00Commented Feb 18, 2020 at 13:36 -
\$\begingroup\$ Ah, you're indeed right. My bad. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2020年02月18日 13:42:51 +00:00Commented Feb 18, 2020 at 13:42
C# (Visual C# Interactive Compiler), (削除) 97 (削除ここまで), (削除) 96 (削除ここまで), 92 bytes
s=>n=>s.Skip(n-1).Select((_,i)=>s.Substring(i,n)).GroupBy(x=>x).OrderBy(g=>g.Count()).Last()
Explanation:
s.Skip(n-1).Select((_,i)=>s.Substring(i,n)) //select every substring of size n
.GroupBy(x=>x) //group by value
.OrderBy(g=>g.Count()) //order by increasing recurrence
.Last() //return the last one
-
1\$\begingroup\$ -1 byte by using a currying lambda
(s,n)=>
tos=>n=>
: Try it online. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2020年02月20日 10:08:55 +00:00Commented Feb 20, 2020 at 10:08 -
\$\begingroup\$ I'm pretty sure returning a Grouping is not acceptable. \$\endgroup\$the default.– the default.2020年02月21日 01:10:02 +00:00Commented Feb 21, 2020 at 1:10
-
\$\begingroup\$ @mypronounismonicareinstate check the comments section \$\endgroup\$Innat3– Innat32020年02月21日 10:07:05 +00:00Commented Feb 21, 2020 at 10:07
Charcoal, 28 bytes
NθF⊕−Lηθ⊞υ✂ηι+θιUMυ⟦Noυιι⟧⊟⌈υ
Try it online! Link is to verbose version of code. Outputs the lexicographically largest substring with the highest count. Explanation:
Nθ
Input the desired substring length.
F⊕−Lηθ⊞υ✂ηι+θι
Extract all substrings of that length and push them to a list.
UMυ⟦Noυιι⟧
Replace each substring with a tuple of its count and the substring..
⊟⌈υ
Print the substring with the highest count.
-
\$\begingroup\$ Is there any specific reason you use your own interpreter instead of the ethproductions one or the TIO one? \$\endgroup\$S.S. Anne– S.S. Anne2020年02月18日 23:23:01 +00:00Commented Feb 18, 2020 at 23:23
-
1\$\begingroup\$ @S.S.Anne I don't know about Shaggy, but I use it because it has a lot of features that make golfing in Japt a lot easier, like the autogolf button \$\endgroup\$Gymhgy– Gymhgy2020年02月19日 01:04:30 +00:00Commented Feb 19, 2020 at 1:04
Wolfram Language (Mathematica), 30 bytes
#&@@Commonest[##~Partition~1]&
Takes an array of characters, returns an array of characters.
Wolfram Language (Mathematica), 36 bytes
#&@@Commonest[##~StringPartition~1]&
Takes a string, returns a string.
Zsh, 89 bytes
local -A m
repeat $#1 ((++m[${1:$[i++]:2ドル}]))
for k v (${(kv)m})((v>M))&&r=$k&&M=$v
<<<$r
local -A m # associative array, "local" is shorter than "typeset"
repeat $#1 { # looping for the length of the string will cause the
# program to use shorter, invalid strings as
# the bash-style slice gets cut short
((++m[${1:$[i++]:2ドル}])) # increment the map at the key with substring
} # starting at $[i++] with length 2ドル
for k v (${(kv)m}) { # loop over the associative array's keys/values
((v>M)) && r=$k && M=$v # find the max value M, save the key as r.
} # looping forward ensures we avoid our invalid subtrings
<<<$r # print $r
C++ (clang), (削除) 201 (削除ここまで) \$\cdots\$ (削除) 151 (削除ここまで) 150 bytes
#import<map>
using S=std::string;S f(S s,int n){std::map<S,int>m;S t,r;for(int i=0,x=0;i<=s.size()-n;r=x<++m[t=s.substr(i++,n)]?x=m[t],t:r);return r;}
Ungolfed
#include<map>
#include<string>
std::string f(const std::string& s,int n) {
std::map<std::string,int> m;
for(int i=0;i<=s.size()-n;++i) {
m[s.substr(i,n)]++;
}
int x=0;
std::string r;
for(auto a:m) {
if(x<a.second) {
x=a.second;
r=a.first;
}
}
return r;
}
Excel (Ver. 1911), 60 Bytes
B2 'Input: String
B3 'Input: Sub-string Length
C2 =-COUNTIF(D2#,D2#)
D2 =MID(B2,SEQUENCE(LEN(B2)),B3)
E2 =SORT(C2#:D2)
F2 'Output
Test Sample
Note: Newlines do exist in cells, but default formatting doesn't show them. Also only 10 of the 105 rows are shown to keep the image a decent size.
PowerShell, 72 bytes
param($s,$n)(0..($s.Length-$n)|%{$s|% s*g $_ $n}|group|sort C*|% N*)[-1]
Unrolled:
param($s,$n)
(
0..($s.Length-$n)|%{
$s|% substring $_ $n
}|group|sort Count|% Name
)[-1]
Go, 131 bytes
func f(s string,l int)(p string){x,i:=0,0
m:=map[string]int{}
for;i<=len(s)-l;i++{t:=s[i:i+l]
m[t]++
if m[t]>x{x=m[t]
p=t}}
return}
Takes as input a string
and an int
representing the string and the length of the substring respectively.
Some newlines can be replaced by semicolons if people prefer one-liners.
n=5
spot. regarding your second point, each line in the test case ends with a newline - there are 12 lines in the test case, so 11 newlines. Or am I missing something? Or maybe the formatting is messing things up? \$\endgroup\$:
does appear 12 times (not sure how you got 8?) - so that's actually the largest. I've added a trailing newline to the test case, so now we have\r\n
and:
tying. \$\endgroup\$:
occurs one more than\n
, but with a trailing newline they indeed both occur an equal amount of times. \$\endgroup\$