Split string using rules

Question 1

Introduction

Some time ago, I need to write function which split string in some specific way. At the first look, task looks trivial, but it was not so easy - especially when I want to decrease code size.

Challenge

As example we have following input string (it can be arbitrary)

Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia ...

We need to splitting it into elements ( groups of adjacent words) using following rules (A-E)

 "Lorem ipsum dolor", // A: take Three words if each has <6 letters 
 "sit amet", // B: take Two words if they have <6 letters and third word >=6 letters
 "consectetur", // C: take One word >=6 letters if next word >=6 letters
 "adipiscing elit", // D: take Two words when first >=6, second <6 letters
 "sed doeiusmod", // E: Two words when first<6, second >=6 letters
 "tempor" // rule C
 "incididunt ut" // rule D
 "Duis aute irure" // rule A
 "dolor in" // rule B
 "reprehenderit in" // rule D
 "esse cillum" // rule E
 "dolor eu fugia" // rule D
 ...

So as you can see input string (only alphanumeric characters) is divided to elements (substrings) - each element can have min one and max three words. You have 5 rules (A-E) to divide your string - if you take words one by one from beginning - only one of this rule applied. When you find rule, then you will know how many words move from input to output (1,2 or 3 words) - after that start again: find next rule for the remaining input words.

Boundary conditions: if last words/word not match any rules then just add them as last element (but two long words cannot be newer in one element)

In the output we should get divided string - each element separated by new line or | (you don't need to use double quotes to wrap each element)

Example Input and Output

Here is example input (only alphanumeric ASCII characters):

Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia

and its output:

Lorem ipsum dolor|sit amet|consectetur|adipiscing elit|sed doeiusmod|tempor|incididunt ut|Duis aute irure|dolor in|reprehenderit in|esse cillum|dolor eu fugia

Question 2

I'm not understanding what's going on with the example. Could you explain how the splitting rules work in general?

Question 3

@xnor while (word length < 6) {join next word (max 3 words per groups)} else if (word length >= 6) { if (next one length is < 6) {return the pair} else { return group as is }

Question 4

@KamilKiełczewski "you have 6 rules (A-B)". Don't you mean 5 rules (A-E)?..

Question 5

Can we take the input as a list of words?

Question 6

I'm still pretty confused, but maybe it's just me given the answers and reopen votes.

Question 7

Python 2, (削除) 92 (削除ここまで) (削除) 90 (削除ここまで) (削除) 88 (削除ここまで) (削除) 87 (削除ここまで) (削除) 86 (削除ここまで) 85 bytes

r=''
n=0
for w in input().split():L=w[:5]<w;x=n+L<3;r+='| '[x]+w;n=n*x-~L
print r[1:]

Try it online!

_{-1 byte, thanks to Kevin Cruijssen}

Question 8

Jelly, 23 bytes

ḲμẈṁ3<6Ḅ+8:58sḢKṄȧƲẎμ1¿

A full-program printing the result.

Try it online!

How?

Given the lengths of three words (a, b, and c) we can write the following mapping for how many word we should take:

a<6? b<6? c<6? words
 1 1 1 3
 1 1 0 2
 1 0 1 2
 1 0 0 2
 0 1 1 2
 0 1 0 2
 0 0 1 1
 0 0 0 1

Treating the comparisons as a single number in binary this is:

bin([a<6,b<6,c<6]): 7 6 5 4 3 2 1 0
 words: 3 2 2 2 2 2 1 1

So we can map like so:

bin([a<6,b<6,c<6]): 7 6 5 4 3 2 1 0
 add eight: 15 14 13 12 11 10 9 8
 divide by five: 3 2 2 2 2 2 1 1

Note that when less than three words remain we want to take all of them, unless there are two left and they are both of length six or more when case C says to take one word. To make this the case we repeat what we have up to length three (with ṁ3 instead of ḣ3) and use that.

a<6? b<6? moulded bin + 8 div 5 (= words)
 1 111 7 15 3 (i.e. all 1)
 0 000 0 8 2 (i.e. all 1)
 1 1 111 7 15 3 (i.e. all 2)
 1 0 101 5 13 2 (i.e. all 2)
 0 1 010 2 10 2 (i.e. all 2)
 0 0 (i.e. C) 000 0 8 1 (i.e. just 1)

The code then works as follows.

ḲμẈṁ3<6Ḅ+8:58sḢKṄȧƲẎμ1¿ - Main Link: list of characters
Ḳ - split at spaces
 ¿ - while...
 1 - ...condition: identity (i.e. while there are still words)
 μ μ - ...do: the monadic chain:
 Ẉ - length of each
 3 - literal three
 ṁ - mould like ([1,2,3])
 6 - literal six
 < - less than? (vectorises)
 Ḅ - from binary to integer
 8 - literal eight
 + - add
 5 - literal five
 : - integer divide
 8 - chain's left argument
 s - split into chunks (of that length)
 Ʋ - last four links as a monad (f(x)):
 Ḣ - head (alters x too)
 K - join with spaces
 Ṅ - print & yield
 ȧ - logical AND (with altered x)
 Ẏ - tighten (back to a list of words)

Question 9

wow - currently you also put on the leader's yellow jersey 🔥🔥🔥 - btw: nice explanation :)

Question 10

Theoretically I am the only one wearing it (since I piped Nick to the post of reaching 23) but I'll very happily share it!

Question 11

I don't know that - however you both has different but same size answers - so we have draw here

Question 12

Yeah, same byte count, different programs, but a very similar method (now). The timing information is in the revision histories if you're interested, but do note that when looking at revisions of posts that updates within five minutes of another are bundled into one revision stamped at the original time (this feature can cause one to not really be able to tell). For what it's worth I imagine we came up with the similar methods independently too.

Question 13

Perl 5, 47 bytes

seems regex can be shorten with this equivalent one

s/(\w{1,5} ){3}|\w{6,} (?=\w{6})|\w+ \w+ /$&
/g

Try it online!

Previous regex

Perl 5, 86 bytes

s/(\w{1,5} ){3}|((\w{1,5} ){2}|\w{6,} )(?=\w{6})|\w{6,} \w{1,5} |\w{1,5} \w{6,} /$&
/g

Try it online!

Not valid:

Perl 5 (`-M5.01` `-lnF/(?:\S{1,5}\K\s+){3}|\S{6,}\K\s+(?=\S{6})|\S+\s+\S+\K\s+/`), 9 bytes

say for@F

Try it online!

Question 14

Nice answer. I was about to try and make a Retina answer for this challenge. A trivial port would be 82 bytes.

Question 15

however not perfect because the trailing space is not removed

Question 16

removing the trailing space 51 bytes

Question 17

... which is a made-up language designed specifically for the challenge, meaning that version would be invalid.

Question 18

@pppery Good point, didn't thought about that. I have used large flags before, like this .NET C# answer with two flags, but those are used a static imports. In this case it would indeed be a 'made-up' language (or should I say flag) only designed for this challenge.

Question 19

AWK, 65 bytes (thanks manatwork!)

BEGIN{RS=FS}{printf(x=n+(L=length(1ドル)>5)<3)?FS1ドル:"|"1ドル;n=n*x+L+1}

Try it online!

I didn't even know AWK had a ternary operator

AWK, (削除) 97 (削除ここまで) (削除) 79 (削除ここまで) 72 bytes (thanks manatwork!)

BEGIN{a[0]="|";a[1]=RS=FS}{printf a[x=n+(L=length(1ドル)>5)<3]1ドル;n=n*x+L+1}

Try it online!

I shamelessly stole the algorithm from @TFeld's Python2 solution.

Question 20

Compacted it a bit, but only tested with that single input. See if you can get something useful from it: Try it online!.

Question 21

Oh, 2 more things: FS's default value is a single space, so you can use it to initialize RS; the challenge says the input will be "only alphanumeric characters", so you can use printf instead of print, making the assignment to ORS unnecessary.

Question 22

Ok, finally spent some time to give a closer look at that array a. A ternary operator would be shorter: Try it online!.

Question 23

@manatwork I almost feel like you should submit that...I didn't even know AWK had a ternary operator :)

Question 24

05AB1E, 28 bytes

0U#v„
 yg6@DX+3‹DŠX*+>Uèy}J¦

Port of @TFeld's Python answer, so make sure to upvote him!

Try it online.

Explanation:

0U # Set variable `X` to 0 (it's 1 by default)
# # Split the (implicit) input-string on spaces
 v # Loop over each word `y`:
 yg # Get the length of the word
 6@ # And check that it's >= 6
 D # Duplicate this
 X+ # Add variable `X` to it
 3‹ # And check that it's smaller than 3
 DŠ # Duplicate this as well, and triple-swap (a,b,c to c,a,b)
 X* # Multiply the <3 check with variable `X`
 + # Add it to the length >=6 check
 >U # Increase it by 1, and set it as the new variable `X`
 „\n è # Index the <3 check into the string "\n "
 y # And push the current word
 }J # After the loop: join the entire stack together
 ¦ # And remove the leading space
 # (after which the top of the stack is output implicitly as result)

Question 25

🚀 your answer is in the top

Question 26

Clean, 206 bytes

import StdEnv,Text
$s=join"|"(map(join" "o map fst)(?[(w,size w<6)\\w<-split" "s]))
?l=case l of[a,b,c:t]|all(snd)[a,b,c]=[[a,b,c]: ?t];[a:t=:[b:_]]|not(snd a||snd b)=[[a]: ?t];[]=[];l=[take 2l: ?(drop 2l)]

Try it online!

Question 27

Jelly, 24 bytes

Ḳμḣ3Ẉ5<+2/Ṁ3_8sḢKṄṛƲẎμ1¿

Try it online!

-1 thanks to Jonathan Allan.

Question 28

can you explain your solution?

Question 29

@KamilKiełczewski Eh, I'm a bit tired right now. Might explain tomorrow.

Question 30

🚀 your answer is in the top

Question 31

Jelly, (削除) 27 (削除ここまで) 23 bytes

ḲμẈ<6ḣ3Ḅ+8:5Ṭk8KṄṛɗ/μÐL

Try it online!

A full program that takes as its argument the input string and prints newline-separated groups of words. Takes advantage of the fact that if the current first three words have their length checked to see if <6 and this is then treated as a binary number, the number of words needed will be 1,1,2,2,2,2,2,2,3 for numbers from 0 to 7 respectively.

Explanation

Ḳ | Split at spaces
 μ μÐL | Repeat the following until no new results:
 Ẉ | - Lengths of lists (i.e. words)
 <6 | - Less than 6
 ḣ3 | - First three
 Ḅ | - Comvert from binary to integer
 +8 | - Add 8
 :5 | - Integer divide by 5
 Ṭ | - Convert from index to boolean list
 k8 | - Split input to this loop iteration at that point
 ɗ/ | - Reduce using following as a dyad:
 K | - Join with spaces
 Ṅ | - Output with trailing newline
 ṛ | - Right argument (i.e. rest of list)

Question 32

do you take idea from Jonathan Allan answer in this version of your answer?

Question 33

@KamilKiełczewski only really the idea of outputting the interim results during the loop rather than returning them from the link as a list.

Question 34

🚀 your answer is in the top

Question 35

Charcoal, 49 bytes

≔⮌⪪S θWθ«≔⌊⟦Lθ⊕ΣE2›6L§θ−κ2⟧ι⪫E−ι∧=ι3‹5L§θ±3⊟θ ¿θ|

Try it online! Link is to verbose version of code. Explanation:

≔⮌⪪S θ

Split the input into words and reverse it so that we can use Pop to remove words in order.

Wθ«

Repeat while there are still words left.

≔⌊⟦Lθ⊕ΣE2›6L§θ−κ2⟧ι

Estimate the number of words needed as equal to one more than the number of the first two words that are less than 6 letters, but no more than the number of words left.

⪫E−ι∧=ι3‹5L§θ±3⊟θ

Adjust the number of words if there are three and the third word is not less than 6 letters, then remove that many words and print them separated with spaces.

¿θ|

Output a separator if there are still more words left.

Question 36

J, (削除) 55 (削除ここまで) 53 bytes

</.~[:+/\@}:[:(}:@],[((3<]),,{~4>])_1{+)/0|.@,1+5<#&>

Try it online!

Question 37

A accept this answer - however can I know why the output is a "table" (literally with borders) ?

Question 38

In J strings of different lengths must be "boxed" if you want to make a list out of them. So I take boxed strings as the input, and since the output requires grouping them, we have a list of boxes each of which contains a list of boxes. The borders you see are just the default way J displays boxed data (it's configurable). The other alternative would be to box the input within my function and return a list of boxed strings, where each string could have multiple words. This felt less consistent to me, though.

TFeld TFeld 19.9k3 gold badges21 silver badges63 bronze badges · Accepted Answer · 2019-09-26 07:23:35Z

Python 2, (削除) 92 (削除ここまで) (削除) 90 (削除ここまで) (削除) 88 (削除ここまで) (削除) 87 (削除ここまで) (削除) 86 (削除ここまで) 85 bytes

r=''
n=0
for w in input().split():L=w[:5]<w;x=n+L<3;r+='| '[x]+w;n=n*x-~L
print r[1:]

Try it online!

_{-1 byte, thanks to Kevin Cruijssen}

Stack Exchange Network

Split string using rules

Introduction

Challenge

Example Input and Output

10 Answers 10

Python 2, (削除) 92 (削除ここまで) (削除) 90 (削除ここまで) (削除) 88 (削除ここまで) (削除) 87 (削除ここまで) (削除) 86 (削除ここまで) 85 bytes

Jelly, 23 bytes

How?

Perl 5, 47 bytes

Perl 5, 86 bytes

Perl 5 (`-M5.01` `-lnF/(?:\S{1,5}\K\s+){3}|\S{6,}\K\s+(?=\S{6})|\S+\s+\S+\K\s+/`), 9 bytes

AWK, 65 bytes (thanks manatwork!)

AWK, (削除) 97 (削除ここまで) (削除) 79 (削除ここまで) 72 bytes (thanks manatwork!)

05AB1E, 28 bytes

Clean, 206 bytes

Jelly, 24 bytes

Jelly, (削除) 27 (削除ここまで) 23 bytes

Explanation

Charcoal, 49 bytes

J, (削除) 55 (削除ここまで) 53 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Split string using rules

Introduction

Challenge

Example Input and Output

10 Answers 10

Python 2, (削除) 92 (削除ここまで) (削除) 90 (削除ここまで) (削除) 88 (削除ここまで) (削除) 87 (削除ここまで) (削除) 86 (削除ここまで) 85 bytes

Jelly, 23 bytes

How?

Perl 5, 47 bytes

Perl 5, 86 bytes

Perl 5 (-M5.01 -lnF/(?:\S{1,5}\K\s+){3}|\S{6,}\K\s+(?=\S{6})|\S+\s+\S+\K\s+/), 9 bytes

AWK, 65 bytes (thanks manatwork!)

AWK, (削除) 97 (削除ここまで) (削除) 79 (削除ここまで) 72 bytes (thanks manatwork!)

05AB1E, 28 bytes

Clean, 206 bytes

Jelly, 24 bytes

Jelly, (削除) 27 (削除ここまで) 23 bytes

Explanation

Charcoal, 49 bytes

J, (削除) 55 (削除ここまで) 53 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Perl 5 (`-M5.01` `-lnF/(?:\S{1,5}\K\s+){3}|\S{6,}\K\s+(?=\S{6})|\S+\s+\S+\K\s+/`), 9 bytes