The most common substring

Question 1

The purpose of this task is to write a program or function to find the most common substring within a given string, of the specified length.

Inputs

A string, s, of any length and characters that are valid as inputs for your language.
A number, n, indicating the length of substring to find. Guaranteed to be equal to or less than the length of the string.

Output

The most common (case-sensitive) substring of s of length n. In the case of a tie, any of the options may be output.

Win Criteria

This is code-golf, so lowest bytes wins; usual exceptions etc. apply.

Examples

Hello, World!, 1 > l (appears 3 times)
Hello! Cheerio!, 2 > o! (appears twice)
The Cat sat on the mat, 2 > at (appears three times)
The Cat sat on the mat, 3 > at or he (both with trailing spaces. Both appear twice - note case-sensitivity so The and the are different substrings)
Mississippi, 4 > issi (appears twice, note that the two occurrences overlap each other)

Some more examples, all using the same input string^*

Bb:maj/2
F:maj/5
Bb:maj/2
G:9/3
Bb:maj7
F:maj/3
G:min7
C:sus4(b7,9)
C:sus4(b7,9)
C:sus4(b7,9)
F:maj
F:maj

(note the trailing newline)

1 > \r\n(appears 12 times, counting \r\n as a single character - my choice, your code may differ on this - otherwise either \r or \n appear the same number of times). : also appears 12 times
2 > :m (appears 8 times)
3 > :ma or maj (appears 7 times)
4 > :maj (appears 7 times)
5 > F:maj or \r\nF:ma or :maj/ (appears 4 times)
14 > \r\nC:sus4(b7,9)\r\n (appears 3 times)

_{(* input is taken from How predictable is popular music?. The result of this task might be used, for example, to find the most efficient way to apply a substitution in order to compress the string)}

Question 2

@KevinCruijssen thanks for the n=5 spot. regarding your second point, each line in the test case ends with a newline - there are 12 lines in the test case, so 11 newlines. Or am I missing something? Or maybe the formatting is messing things up?

Question 3

...but, : does appear 12 times (not sure how you got 8?) - so that's actually the largest. I've added a trailing newline to the test case, so now we have \r\n and : tying.

Question 4

Ah oops, not sure how I got 8 instead of 12 indeed.. Mainly wanted to say : occurs one more than \n, but with a trailing newline they indeed both occur an equal amount of times.

Question 5

Is it acceptable to return the highest occurring substring with its count? (C.f. this answer)

Question 6

@RGS sure, why not?

Question 7

Python 3.8, (削除) 75 (削除ここまで) (削除) 68 (削除ここまで) (削除) 97 (削除ここまで) 63 bytes

lambda s,n:max(l:=[s[j:j+n]for j in range(len(s))],key=l.count)

You can try it online! Increased byte count because, as @xnor kindly pointed out, Python's str.count doesn't count overlapping substrings... But then @xnor's reformulation of what I was doing allowed to slice a third of the bytes!

How:

l:=[s[j:j+n]for j in range(len(s))]

This creates a list of all the substrings in s of size n, plus the suffixes of size n-1, n-2, ..., 1. Then we find the max on that list, with the numerical value being used to sort given by

l.count

That means that if a and b are from l, a > b if a shows up more times in l than b. This means the shorter suffixes are never the result of the max because they only show up once, and even if the correct-sized substrings only show up once, they are first in l so they come out instead of the short suffixes. This allows me to save some bytes in the range used, given that I don't have to prevent i+n from being larger than len(s).

~~(削除) # Python, ~~83 (削除ここまで)~~ 64 bytes~~

lambda s,n:max((s[i:i+n]for i in range(len(s)-n+1)),key=s.count)

Thanks to @mypetlion I saved a LOT of bytes :D

You can try it online with my very own incredible test case!

Question 8

Replace sorted(...)[-1] with max(...) to save 7 bytes.

Question 9

Replace lambda t:s.count(t) with s.count to save 12 bytes.

Question 10

fixes typos

Question 11

@S.S.Anne gives credit, too

Question 12

It looks like this unfortunately doesn't get the "Mississippi", 4 test case right, due to how Python counts substrings.

Question 13

05AB1E, 5 bytes

ŒIù.M

Try it online or verify the smaller test cases or verify the larger test case (which is a bit slow).

Explanation:

Œ # Push all substrings of the (implicit) input-string
 Iù # Only keep substrings of a length equal to the second input-integer
 .M # Only keep the most frequent item of the remaining substrings
 # (after which it is output implicitly as result)

Question 14

Brachylog, 8 bytes

s)fọtoth

Try it online!

Explanation

 f Find all...
s ...substrings of <1st element of input>...
 ) ...of length <2nd element of input>
 ọ Get the list of each substring with its number of occurrence
 to Order by the number of occurrence
 t Take the last one
 h Output the substring itself

Question 15

JavaScript (ES6), (削除) 85 83 (削除ここまで) 80 bytes

Takes input as (string)(length).

s=>n=>[...s].map(o=m=(x,i)=>(o[k=s.substr(i,n)]=-~o[k])<m|!k[n-1]?0:m=o[O=k])&&O

Try it online!

Question 16

Java 10, (削除) 212 (削除ここまで) (削除) 211 (削除ここまで) (削除) 196 (削除ここまで) (削除) 195 (削除ここまで) (削除) 146 (削除ここまで) 143 bytes

s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;}

-15 bytes by outputting the substrings with their count, which is allowed.
-1 byte thanks to @ceilingcat.
-52 bytes thanks to @OlivierGrégoire.

Try it online.

Explanation:

s->n->{ // Method with String and Integer parameters and String return-type
 String r="", // Result-String, starting empty
 t; // Temp-String
 for(int m=0, // Largest count, starting at 0
 c, // Temp count integer
 j, // Temp index integer
 i // Index-integer `i`
 =s.length();i-->n;)
 // Loop `i` in the range (input-length, n]:
 for(c=0, // Reset the temp-count to 0
 j=-1; // And the temp-index integer to -1
 (j=s.indexOf(t=s.substring(i-n,i)
 // Set String `t` to a substring of the input-String in the range [i-n,i)
 j+1))// Set the temp-index `j` to the index of String `t` in the input-String,
 // only looking at the trailing portion after index `j+1`
 >=0;) // And continue looping as long as long as that substring is found
 if(++c>m){// If the temp-count + 1 is larger than the current largest count:
 m=c; // Set this current largest count to this temp-count + 1
 r=t;} // And the result-String to this temp-String
 return r;} // After the nested loops, return the result-String

Question 17

146 bytes:

S->n->{String R="",x;for(int M=0,I=0,m,i;I<S.length()-n;I++)for(m=0,i=-1;(i=S.indexOf(x=S.substring(I,I+n),i+1))>=0;)if(++m>M){M=m;R=x;}return R;}

(no imports, and Java 8, not 10)

Question 18

@OlivierGrégoire Thanks! That's a much better approach. I had the feeling it was possible without the Map and/or stream, but couldn't figure it out.

Question 19

I also tried with .split("\\Q"+substring+"\\E"), without success because of the overlapping test case in Mississippi

Question 20

143 bytes:

s->n->{String r="",t;for(int m=0,c,j,i=s.length();i-->n;)for(c=0,j=-1;(j=s.indexOf(t=s.substring(i-n,i),j+1))>=0;)if(++c>m){m=c;r=t;}return r;}

(just changing the iteration manner)

Question 21

Perl 6, (削除) 49 (削除ここまで) 46 bytes

-3 bytes thanks to SirBogman

{$^n;bag($^a~~m:ov/.**{$n}/X~$).max(*{*}).key}

Try it online!

Explanation

{ } # Anonymous block
 $^n; # Declare argument
 $^a~~m / / # Regex match input string
 :ov # Also return overlapping matches
 .**{$n} # Match n-character strings
 X~$ # Stringify matches
 bag( ) # Convert to Bag (multiset)
 .max(*{*}) # Find Pair string=>count with max value
 .key # Get key (string)

Question 22

46 bytes /.**{$n}/ works. Not sure why /.**$n/ doesn't.

Question 23

@SirBogman Cool, this sure would have helped me a few times. /$n/ is supposed to interpolate literal text (like \Q...\E in Perl 5), so I can see why .**$n doesn't work.

Question 24

I don't fully understand what .max(*{*}) is doing. It seems like both of those whatever parameters are Pairs?

Question 25

@SirBogman Only the first * is a whatever parameter. The second * is simply to get a full slice of the Pair, returning a list containing its only value. So *{*} is effectively the same as *.values.

Question 26

Cool. Those whatevers are very flexible.

Question 27

PHP, (削除) 133 (削除ここまで) 126 bytes

for(;$k<=strlen($i=$argv[1])-$j=$argv[2];)$s[]=substr($i,$k++,$j);$c=array_count_values($s);asort($c);echo array_key_last($c);

Try it online!

Annoyingly asort() ~~(削除) and end() (削除ここまで)~~ can't be chained into one expression.

-7 bytes thanks to @manatwork

Question 28

PHP 7.3 has array_key_last() to reduce it to 126 characters.

Question 29

I really should learn all my array functions.

Question 30

JavaScript (Node.js), 92 bytes

s=>n=>s.map(F=t=>([F[w=(w+t).slice(-n)]=-~F[w],w]),w="").sort(([a],[b])=>a-b)[s.length-1][1]

Try it online!

Takes s as a list of characters in (s)(n) syntax.

Question 31

=>a-b)[s.length-1] to =>b-a)[0] to sort in reverse order and save some bytes.

Question 32

@KevinCruijssen That could fail if there are no repeating substrings. Try f([..."123456"])(2).

Question 33

Ah, you're indeed right. My bad.

Question 34

C# (Visual C# Interactive Compiler), (削除) 97 (削除ここまで), (削除) 96 (削除ここまで), 92 bytes

s=>n=>s.Skip(n-1).Select((_,i)=>s.Substring(i,n)).GroupBy(x=>x).OrderBy(g=>g.Count()).Last()

Try it online!

Explanation:

s.Skip(n-1).Select((_,i)=>s.Substring(i,n)) //select every substring of size n
 .GroupBy(x=>x) //group by value
 .OrderBy(g=>g.Count()) //order by increasing recurrence
 .Last() //return the last one

Question 35

-1 byte by using a currying lambda (s,n)=> to s=>n=>: Try it online.

Question 36

I'm pretty sure returning a Grouping is not acceptable.

Question 37

@mypronounismonicareinstate check the comments section

Question 38

Charcoal, 28 bytes

NθF⊕−Lηθ⊞υ✂ηι+θιUMυ⟦Noυιι⟧⊟⌈υ

Try it online! Link is to verbose version of code. Outputs the lexicographically largest substring with the highest count. Explanation:

Nθ

Input the desired substring length.

F⊕−Lηθ⊞υ✂ηι+θι

Extract all substrings of that length and push them to a list.

UMυ⟦Noυιι⟧

Replace each substring with a tuple of its count and the substring..

⊟⌈υ

Print the substring with the highest count.

Question 39

Japt `-g`, 8 bytes

ãV ü ñÊÌ

Try it

Question 40

Is there any specific reason you use your own interpreter instead of the ethproductions one or the TIO one?

Question 41

@S.S.Anne I don't know about Shaggy, but I use it because it has a lot of features that make golfing in Japt a lot easier, like the autogolf button

Question 42

Wolfram Language (Mathematica), 30 bytes

#&@@Commonest[##~Partition~1]&

Takes an array of characters, returns an array of characters.

Try it online!

Wolfram Language (Mathematica), 36 bytes

#&@@Commonest[##~StringPartition~1]&

Takes a string, returns a string.

Try it online!

Question 43

Zsh, 89 bytes

local -A m
repeat $#1 ((++m[${1:$[i++]:2ドル}]))
for k v (${(kv)m})((v>M))&&r=$k&&M=$v
<<<$r

Try it online!

local -A m # associative array, "local" is shorter than "typeset"
repeat $#1 { # looping for the length of the string will cause the
 # program to use shorter, invalid strings as
 # the bash-style slice gets cut short
 ((++m[${1:$[i++]:2ドル}])) # increment the map at the key with substring
} # starting at $[i++] with length 2ドル
for k v (${(kv)m}) { # loop over the associative array's keys/values
 ((v>M)) && r=$k && M=$v # find the max value M, save the key as r.
} # looping forward ensures we avoid our invalid subtrings
<<<$r # print $r

Question 44

C++ (clang), ^{(削除) 201 (削除ここまで) \$\cdots\$ (削除) 151 (削除ここまで)} 150 bytes

#import<map>
using S=std::string;S f(S s,int n){std::map<S,int>m;S t,r;for(int i=0,x=0;i<=s.size()-n;r=x<++m[t=s.substr(i++,n)]?x=m[t],t:r);return r;}

Try it online!

Ungolfed

#include<map>
#include<string>
std::string f(const std::string& s,int n) {
 std::map<std::string,int> m;
 for(int i=0;i<=s.size()-n;++i) {
 m[s.substr(i,n)]++;
 }
 int x=0;
 std::string r;
 for(auto a:m) {
 if(x<a.second) {
 x=a.second;
 r=a.first;
 }
 }
 return r;
}

Question 45

Excel (Ver. 1911), 60 Bytes

B2 'Input: String
B3 'Input: Sub-string Length
C2 =-COUNTIF(D2#,D2#)
D2 =MID(B2,SEQUENCE(LEN(B2)),B3)
E2 =SORT(C2#:D2)
F2 'Output

Test Sample

Note: Newlines do exist in cells, but default formatting doesn't show them. Also only 10 of the 105 rows are shown to keep the image a decent size.

Test sample

Question 46

Jelly, 6 bytes

I have a feeling there may be an elusive 5-byter out there.

ṡċ@Þ`Ṫ

Try it online!

Question 47

PowerShell, 72 bytes

param($s,$n)(0..($s.Length-$n)|%{$s|% s*g $_ $n}|group|sort C*|% N*)[-1]

Try it online!

Unrolled:

param($s,$n)
(
 0..($s.Length-$n)|%{
 $s|% substring $_ $n
 }|group|sort Count|% Name
)[-1]

Question 48

Ruby, 58 bytes

->s,n{a=(0..s.size).map{|i|s[i,n]};a.max_by{|e|a.count e}}

Try it online!

Question 49

Perl 5 `-lp`, 47 bytes

++$k{$_}>$k{$\}&&($\=$_)for<>=~/(?=(.{$_}))/g}{

Try it online!

Question 50

Japt `-h`, 8 bytes

ãV
ñ@è¶X

Try it

Question 51

Go, 131 bytes

func f(s string,l int)(p string){x,i:=0,0
m:=map[string]int{}
for;i<=len(s)-l;i++{t:=s[i:i+l]
m[t]++
if m[t]>x{x=m[t]
p=t}}
return}

Takes as input a string and an int representing the string and the length of the substring respectively.

Some newlines can be replaced by semicolons if people prefer one-liners.

Try it online!

RGS RGS 14.2k1 gold badge61 silver badges117 bronze badges · Accepted Answer · 2020-02-18 14:32:01Z

Python 3.8, (削除) 75 (削除ここまで) (削除) 68 (削除ここまで) (削除) 97 (削除ここまで) 63 bytes

lambda s,n:max(l:=[s[j:j+n]for j in range(len(s))],key=l.count)

You can try it online! Increased byte count because, as @xnor kindly pointed out, Python's str.count doesn't count overlapping substrings... But then @xnor's reformulation of what I was doing allowed to slice a third of the bytes!

How:

l:=[s[j:j+n]for j in range(len(s))]

This creates a list of all the substrings in s of size n, plus the suffixes of size n-1, n-2, ..., 1. Then we find the max on that list, with the numerical value being used to sort given by

l.count

That means that if a and b are from l, a > b if a shows up more times in l than b. This means the shorter suffixes are never the result of the max because they only show up once, and even if the correct-sized substrings only show up once, they are first in l so they come out instead of the short suffixes. This allows me to save some bytes in the range used, given that I don't have to prevent i+n from being larger than len(s).

~~(削除) # Python, ~~83 (削除ここまで)~~ 64 bytes~~

lambda s,n:max((s[i:i+n]for i in range(len(s)-n+1)),key=s.count)

Thanks to @mypetlion I saved a LOT of bytes :D

You can try it online with my very own incredible test case!

Replace lambda t:s.count(t) with s.count to save 12 bytes.
It looks like this unfortunately doesn't get the "Mississippi", 4 test case right, due to how Python counts substrings.

The most common substring

Inputs

Output

Win Criteria

Examples

Some more examples, all using the same input string*

21 Answers 21

Python 3.8, (削除) 75 (削除ここまで) (削除) 68 (削除ここまで) (削除) 97 (削除ここまで) 63 bytes

How:

05AB1E, 5 bytes

Brachylog, 8 bytes

Explanation

JavaScript (ES6), (削除) 85 83 (削除ここまで) 80 bytes

Java 10, (削除) 212 (削除ここまで) (削除) 211 (削除ここまで) (削除) 196 (削除ここまで) (削除) 195 (削除ここまで) (削除) 146 (削除ここまで) 143 bytes

Perl 6, (削除) 49 (削除ここまで) 46 bytes

Explanation

PHP, (削除) 133 (削除ここまで) 126 bytes

JavaScript (Node.js), 92 bytes

C# (Visual C# Interactive Compiler), (削除) 97 (削除ここまで), (削除) 96 (削除ここまで), 92 bytes

Charcoal, 28 bytes

Japt -g, 8 bytes

Wolfram Language (Mathematica), 30 bytes

Wolfram Language (Mathematica), 36 bytes

Zsh, 89 bytes

C++ (clang), (削除) 201 (削除ここまで) \$\cdots\$ (削除) 151 (削除ここまで) 150 bytes

Ungolfed

Excel (Ver. 1911), 60 Bytes

Test Sample

Jelly, 6 bytes

PowerShell, 72 bytes

Ruby, 58 bytes

Perl 5 -lp, 47 bytes

Japt -h, 8 bytes

Go, 131 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Some more examples, all using the same input string^*

Japt `-g`, 8 bytes

C++ (clang), ^{(削除) 201 (削除ここまで) \$\cdots\$ (削除) 151 (削除ここまで)} 150 bytes

Perl 5 `-lp`, 47 bytes

Japt `-h`, 8 bytes