Introduction
Some time ago, I need to write function which split string in some specific way. At the first look, task looks trivial, but it was not so easy - especially when I want to decrease code size.
Challenge
As example we have following input string (it can be arbitrary)
Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia ...
We need to splitting it into elements ( groups of adjacent words) using following rules (A-E)
"Lorem ipsum dolor", // A: take Three words if each has <6 letters
"sit amet", // B: take Two words if they have <6 letters and third word >=6 letters
"consectetur", // C: take One word >=6 letters if next word >=6 letters
"adipiscing elit", // D: take Two words when first >=6, second <6 letters
"sed doeiusmod", // E: Two words when first<6, second >=6 letters
"tempor" // rule C
"incididunt ut" // rule D
"Duis aute irure" // rule A
"dolor in" // rule B
"reprehenderit in" // rule D
"esse cillum" // rule E
"dolor eu fugia" // rule D
...
So as you can see input string (only alphanumeric characters) is divided to elements (substrings) - each element can have min one and max three words. You have 5 rules (A-E) to divide your string - if you take words one by one from beginning - only one of this rule applied. When you find rule, then you will know how many words move from input to output (1,2 or 3 words) - after that start again: find next rule for the remaining input words.
Boundary conditions: if last words/word not match any rules then just add them as last element (but two long words cannot be newer in one element)
In the output we should get divided string - each element separated by new line or |
(you don't need to use double quotes to wrap each element)
Example Input and Output
Here is example input (only alphanumeric ASCII characters):
Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia
and its output:
Lorem ipsum dolor|sit amet|consectetur|adipiscing elit|sed doeiusmod|tempor|incididunt ut|Duis aute irure|dolor in|reprehenderit in|esse cillum|dolor eu fugia
-
2\$\begingroup\$ I'm not understanding what's going on with the example. Could you explain how the splitting rules work in general? \$\endgroup\$xnor– xnor2019年09月26日 07:08:29 +00:00Commented Sep 26, 2019 at 7:08
-
1\$\begingroup\$ @xnor while (word length < 6) {join next word (max 3 words per groups)} else if (word length >= 6) { if (next one length is < 6) {return the pair} else { return group as is } \$\endgroup\$jonatjano– jonatjano2019年09月26日 07:11:56 +00:00Commented Sep 26, 2019 at 7:11
-
1\$\begingroup\$ @KamilKiełczewski "you have 6 rules (A-B)". Don't you mean 5 rules (A-E)?.. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2019年09月26日 07:11:57 +00:00Commented Sep 26, 2019 at 7:11
-
1\$\begingroup\$ Can we take the input as a list of words? \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2019年09月26日 07:12:52 +00:00Commented Sep 26, 2019 at 7:12
-
2\$\begingroup\$ I'm still pretty confused, but maybe it's just me given the answers and reopen votes. \$\endgroup\$xnor– xnor2019年09月26日 17:41:43 +00:00Commented Sep 26, 2019 at 17:41
10 Answers 10
Python 2, (削除) 92 (削除ここまで) (削除) 90 (削除ここまで) (削除) 88 (削除ここまで) (削除) 87 (削除ここまで) (削除) 86 (削除ここまで) 85 bytes
r=''
n=0
for w in input().split():L=w[:5]<w;x=n+L<3;r+='| '[x]+w;n=n*x-~L
print r[1:]
-1 byte, thanks to Kevin Cruijssen
Jelly, 23 bytes
ḲμẈṁ3<6Ḅ+8:58sḢKṄȧƲẎμ1¿
A full-program printing the result.
How?
Given the lengths of three words (a
, b
, and c
) we can write the following mapping for how many word we should take:
a<6? b<6? c<6? words
1 1 1 3
1 1 0 2
1 0 1 2
1 0 0 2
0 1 1 2
0 1 0 2
0 0 1 1
0 0 0 1
Treating the comparisons as a single number in binary this is:
bin([a<6,b<6,c<6]): 7 6 5 4 3 2 1 0
words: 3 2 2 2 2 2 1 1
So we can map like so:
bin([a<6,b<6,c<6]): 7 6 5 4 3 2 1 0
add eight: 15 14 13 12 11 10 9 8
divide by five: 3 2 2 2 2 2 1 1
Note that when less than three words remain we want to take all of them, unless there are two left and they are both of length six or more when case C
says to take one word. To make this the case we repeat what we have up to length three (with ṁ3
instead of ḣ3
) and use that.
a<6? b<6? moulded bin + 8 div 5 (= words)
1 111 7 15 3 (i.e. all 1)
0 000 0 8 2 (i.e. all 1)
1 1 111 7 15 3 (i.e. all 2)
1 0 101 5 13 2 (i.e. all 2)
0 1 010 2 10 2 (i.e. all 2)
0 0 (i.e. C) 000 0 8 1 (i.e. just 1)
The code then works as follows.
ḲμẈṁ3<6Ḅ+8:58sḢKṄȧƲẎμ1¿ - Main Link: list of characters
Ḳ - split at spaces
¿ - while...
1 - ...condition: identity (i.e. while there are still words)
μ μ - ...do: the monadic chain:
Ẉ - length of each
3 - literal three
ṁ - mould like ([1,2,3])
6 - literal six
< - less than? (vectorises)
Ḅ - from binary to integer
8 - literal eight
+ - add
5 - literal five
: - integer divide
8 - chain's left argument
s - split into chunks (of that length)
Ʋ - last four links as a monad (f(x)):
Ḣ - head (alters x too)
K - join with spaces
Ṅ - print & yield
ȧ - logical AND (with altered x)
Ẏ - tighten (back to a list of words)
-
\$\begingroup\$ wow - currently you also put on the leader's yellow jersey 🔥🔥🔥 - btw: nice explanation :) \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月01日 15:02:58 +00:00Commented Oct 1, 2019 at 15:02
-
\$\begingroup\$ Theoretically I am the only one wearing it (since I piped Nick to the post of reaching 23) but I'll very happily share it! \$\endgroup\$Jonathan Allan– Jonathan Allan2019年10月01日 15:37:17 +00:00Commented Oct 1, 2019 at 15:37
-
\$\begingroup\$ I don't know that - however you both has different but same size answers - so we have draw here \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月01日 15:39:52 +00:00Commented Oct 1, 2019 at 15:39
-
\$\begingroup\$ Yeah, same byte count, different programs, but a very similar method (now). The timing information is in the revision histories if you're interested, but do note that when looking at revisions of posts that updates within five minutes of another are bundled into one revision stamped at the original time (this feature can cause one to not really be able to tell). For what it's worth I imagine we came up with the similar methods independently too. \$\endgroup\$Jonathan Allan– Jonathan Allan2019年10月01日 15:45:33 +00:00Commented Oct 1, 2019 at 15:45
Perl 5, 47 bytes
seems regex can be shorten with this equivalent one
s/(\w{1,5} ){3}|\w{6,} (?=\w{6})|\w+ \w+ /$&
/g
Previous regex
Perl 5, 86 bytes
s/(\w{1,5} ){3}|((\w{1,5} ){2}|\w{6,} )(?=\w{6})|\w{6,} \w{1,5} |\w{1,5} \w{6,} /$&
/g
Not valid:
Perl 5 (-M5.01
-lnF/(?:\S{1,5}\K\s+){3}|\S{6,}\K\s+(?=\S{6})|\S+\s+\S+\K\s+/
), 9 bytes
say for@F
-
\$\begingroup\$ Nice answer. I was about to try and make a Retina answer for this challenge. A trivial port would be 82 bytes. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2019年09月26日 08:27:25 +00:00Commented Sep 26, 2019 at 8:27
-
\$\begingroup\$ however not perfect because the trailing space is not removed \$\endgroup\$Nahuel Fouilleul– Nahuel Fouilleul2019年09月26日 08:29:00 +00:00Commented Sep 26, 2019 at 8:29
-
1\$\begingroup\$ removing the trailing space 51 bytes \$\endgroup\$Nahuel Fouilleul– Nahuel Fouilleul2019年09月26日 08:46:05 +00:00Commented Sep 26, 2019 at 8:46
-
3\$\begingroup\$ ... which is a made-up language designed specifically for the challenge, meaning that version would be invalid. \$\endgroup\$The Fifth Marshal– The Fifth Marshal2019年09月26日 11:47:49 +00:00Commented Sep 26, 2019 at 11:47
-
2\$\begingroup\$ @pppery Good point, didn't thought about that. I have used large flags before, like this .NET C# answer with two flags, but those are used a static imports. In this case it would indeed be a 'made-up' language (or should I say flag) only designed for this challenge. \$\endgroup\$Kevin Cruijssen– Kevin Cruijssen2019年09月26日 15:18:01 +00:00Commented Sep 26, 2019 at 15:18
AWK, 65 bytes (thanks manatwork!)
BEGIN{RS=FS}{printf(x=n+(L=length(1ドル)>5)<3)?FS1ドル:"|"1ドル;n=n*x+L+1}
I didn't even know AWK had a ternary operator
AWK, (削除) 97 (削除ここまで) (削除) 79 (削除ここまで) 72 bytes (thanks manatwork!)
BEGIN{a[0]="|";a[1]=RS=FS}{printf a[x=n+(L=length(1ドル)>5)<3]1ドル;n=n*x+L+1}
I shamelessly stole the algorithm from @TFeld's Python2 solution.
-
1\$\begingroup\$ Compacted it a bit, but only tested with that single input. See if you can get something useful from it: Try it online!. \$\endgroup\$manatwork– manatwork2019年10月01日 15:02:03 +00:00Commented Oct 1, 2019 at 15:02
-
1\$\begingroup\$ Oh, 2 more things:
FS
's default value is a single space, so you can use it to initializeRS
; the challenge says the input will be "only alphanumeric characters", so you can useprintf
instead ofprint
, making the assignment toORS
unnecessary. \$\endgroup\$manatwork– manatwork2019年10月01日 15:30:03 +00:00Commented Oct 1, 2019 at 15:30 -
1\$\begingroup\$ Ok, finally spent some time to give a closer look at that array a. A ternary operator would be shorter: Try it online!. \$\endgroup\$manatwork– manatwork2019年10月01日 15:44:27 +00:00Commented Oct 1, 2019 at 15:44
-
\$\begingroup\$ @manatwork I almost feel like you should submit that...I didn't even know AWK had a ternary operator :) \$\endgroup\$Daniel LaVine– Daniel LaVine2019年10月01日 15:51:49 +00:00Commented Oct 1, 2019 at 15:51
05AB1E, 28 bytes
0U#v„
yg6@DX+3‹DŠX*+>Uèy}J¦
Port of @TFeld's Python answer, so make sure to upvote him!
Explanation:
0U # Set variable `X` to 0 (it's 1 by default)
# # Split the (implicit) input-string on spaces
v # Loop over each word `y`:
yg # Get the length of the word
6@ # And check that it's >= 6
D # Duplicate this
X+ # Add variable `X` to it
3‹ # And check that it's smaller than 3
DŠ # Duplicate this as well, and triple-swap (a,b,c to c,a,b)
X* # Multiply the <3 check with variable `X`
+ # Add it to the length >=6 check
>U # Increase it by 1, and set it as the new variable `X`
„\n è # Index the <3 check into the string "\n "
y # And push the current word
}J # After the loop: join the entire stack together
¦ # And remove the leading space
# (after which the top of the stack is output implicitly as result)
-
\$\begingroup\$ 🚀 your answer is in the top \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月02日 06:08:01 +00:00Commented Oct 2, 2019 at 6:08
Clean, 206 bytes
import StdEnv,Text
$s=join"|"(map(join" "o map fst)(?[(w,size w<6)\\w<-split" "s]))
?l=case l of[a,b,c:t]|all(snd)[a,b,c]=[[a,b,c]: ?t];[a:t=:[b:_]]|not(snd a||snd b)=[[a]: ?t];[]=[];l=[take 2l: ?(drop 2l)]
-
\$\begingroup\$ can you explain your solution? \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年09月27日 21:42:02 +00:00Commented Sep 27, 2019 at 21:42
-
\$\begingroup\$ @KamilKiełczewski Eh, I'm a bit tired right now. Might explain tomorrow. \$\endgroup\$Erik the Outgolfer– Erik the Outgolfer2019年09月27日 21:43:33 +00:00Commented Sep 27, 2019 at 21:43
-
\$\begingroup\$ 🚀 your answer is in the top \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月02日 06:09:14 +00:00Commented Oct 2, 2019 at 6:09
Jelly, (削除) 27 (削除ここまで) 23 bytes
ḲμẈ<6ḣ3Ḅ+8:5Ṭk8KṄṛɗ/μÐL
A full program that takes as its argument the input string and prints newline-separated groups of words. Takes advantage of the fact that if the current first three words have their length checked to see if <6 and this is then treated as a binary number, the number of words needed will be 1,1,2,2,2,2,2,2,3 for numbers from 0 to 7 respectively.
Explanation
Ḳ | Split at spaces
μ μÐL | Repeat the following until no new results:
Ẉ | - Lengths of lists (i.e. words)
<6 | - Less than 6
ḣ3 | - First three
Ḅ | - Comvert from binary to integer
+8 | - Add 8
:5 | - Integer divide by 5
Ṭ | - Convert from index to boolean list
k8 | - Split input to this loop iteration at that point
ɗ/ | - Reduce using following as a dyad:
K | - Join with spaces
Ṅ | - Output with trailing newline
ṛ | - Right argument (i.e. rest of list)
-
\$\begingroup\$ do you take idea from Jonathan Allan answer in this version of your answer? \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月01日 16:00:39 +00:00Commented Oct 1, 2019 at 16:00
-
\$\begingroup\$ @KamilKiełczewski only really the idea of outputting the interim results during the loop rather than returning them from the link as a list. \$\endgroup\$Nick Kennedy– Nick Kennedy2019年10月01日 16:11:46 +00:00Commented Oct 1, 2019 at 16:11
-
\$\begingroup\$ 🚀 your answer is in the top \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年10月02日 06:08:42 +00:00Commented Oct 2, 2019 at 6:08
Charcoal, 49 bytes
≔⮌⪪S θWθ«≔⌊⟦Lθ⊕ΣE2›6L§θ−κ2⟧ι⪫E−ι∧=ι3‹5L§θ±3⊟θ ¿θ|
Try it online! Link is to verbose version of code. Explanation:
≔⮌⪪S θ
Split the input into words and reverse it so that we can use Pop
to remove words in order.
Wθ«
Repeat while there are still words left.
≔⌊⟦Lθ⊕ΣE2›6L§θ−κ2⟧ι
Estimate the number of words needed as equal to one more than the number of the first two words that are less than 6 letters, but no more than the number of words left.
⪫E−ι∧=ι3‹5L§θ±3⊟θ
Adjust the number of words if there are three and the third word is not less than 6 letters, then remove that many words and print them separated with spaces.
¿θ|
Output a separator if there are still more words left.
-
\$\begingroup\$ A accept this answer - however can I know why the output is a "table" (literally with borders) ? \$\endgroup\$Kamil Kiełczewski– Kamil Kiełczewski2019年09月27日 05:06:21 +00:00Commented Sep 27, 2019 at 5:06
-
1\$\begingroup\$ In J strings of different lengths must be "boxed" if you want to make a list out of them. So I take boxed strings as the input, and since the output requires grouping them, we have a list of boxes each of which contains a list of boxes. The borders you see are just the default way J displays boxed data (it's configurable). The other alternative would be to box the input within my function and return a list of boxed strings, where each string could have multiple words. This felt less consistent to me, though. \$\endgroup\$Jonah– Jonah2019年09月27日 05:16:57 +00:00Commented Sep 27, 2019 at 5:16