I wrote this function in scala that uses tail recursion to remove nested blocks from a text.
Usage examples:
removeBlocks("123{456}789", "{", "}") yields "123789"
removeBlocks("123{a{b{c}b}a}789", "{", "}") yields "123789"
removeBlocks("123<div>456</div>789", "<div>", "</div>") yields "123789"
removeBlocks("123<<<div><<<</div>>>789", "<div>", "</div>") yields "123<<>>789"
This is one of my first attempts in a functional language, and I'm deeply dissatistied with my code. It's too long, too nested, and could probably be made more readable. I'm coming from C# and hope to be forgiven for this bad functional code. I'll be happy to hear how it can be improved.
def removeBlocks(text: String, startMarker: String, endMarker: String) = {
val startMarkerSize = startMarker.size
val endMarkerSize = endMarker.size
val startMarkerHead = startMarker.head
val endMarkerHead = endMarker.head
def removeBlocksAcc(s: String, acc: String, nestingLevel: Int): String =
if (s.isEmpty) acc
else {
val (startMarkerCandidate, tail1) = s.splitAt(startMarkerSize)
if (startMarkerCandidate == startMarker)
removeBlocksAcc(tail1, acc, nestingLevel + 1)
else {
val (endMarkerCandidate, tail2) = s.splitAt(endMarkerSize)
if (endMarkerCandidate == endMarker)
removeBlocksAcc(tail2, acc, math.max(nestingLevel - 1, 0))
else {
val (safePart, candidate) = s.tail.span(c => c != startMarkerHead && c != endMarkerHead)
if (nestingLevel == 0)
removeBlocksAcc(candidate, acc + s.head + safePart, nestingLevel)
else
removeBlocksAcc(candidate, acc, nestingLevel)
}
}
}
removeBlocksAcc(text, "", 0)
}
3 Answers 3
I think you are trying to do some pattern matching on strings.
So I did this:
def escape(s: String): String = Seq("{", "}", "whatever").foldLeft(s)((x, y) => x.replaceAllLiterally(y, "\\" + y))
def removeBlocks(text: String, open: String, close: String): String = {
val startReg = (s"${escape(open)}(.*)$$").r
val endReg = (s"${escape(close)}(.*)$$").r
val otherText = "(.)(.*)$".r
@tailrec
def removeBlocksAux(text: String, acc: String = "", lvl: Int = 0): String = {
assume((text.length() > 0 || lvl == 0) && lvl >= 0)
(text, lvl) match {
case (startReg(text), lvl) => removeBlocksAux(text, acc, lvl + 1)
case (endReg(text), lvl) => removeBlocksAux(text, acc, lvl - 1)
case (otherText(c, text), 0) => removeBlocksAux(text, acc + c)
case (otherText(c, text), lvl) => removeBlocksAux(text, acc, lvl)
case ("", _) => acc
}
}
removeBlocksAux(text)
}
I tried to find a way to define the regular expressions inside the match
group, but I failed at that, it may not be a good idea after all, but I would like to have a more compact and efficient representation for it. The escape function should escape all special cases of operators in Scala regex, that sounds boring, so there may be already a method for that.
I'm also not a big fan of having a function inside a function, but I kept that design decision. Finally, I'd simply use parser combinators for this, but I guess that's not the point.
-
\$\begingroup\$ Thank you, but regular expressions are just not up to the task. Try this:
removeBlocks("123{a{b{c}b}a}456{a}789", "{", "}")
. This should give123456789
, but if we replace with(s"${escape(open)}(.*)$$").r
, it will leave just123789
. Doing non-greedy matching won't help either. \$\endgroup\$escargot agile– escargot agile2014年12月04日 07:16:48 +00:00Commented Dec 4, 2014 at 7:16 -
\$\begingroup\$ If you are going to say that my code doesn't work you could at least try to run it before... \$\endgroup\$Trylks– Trylks2014年12月04日 11:13:19 +00:00Commented Dec 4, 2014 at 11:13
-
\$\begingroup\$ You're right, I'm so sorry! It does work. I was sure there's no solution with regular expressions, but I sloppily missed the fact that your regular expressions don't try to capture the entire block, but only its beginning or end. This is indeed an elegant solution, and it's especially good because it challenged my assumptions. Thank you! \$\endgroup\$escargot agile– escargot agile2014年12月08日 06:10:57 +00:00Commented Dec 8, 2014 at 6:10
As Illya has mentioned Regex alone cannot solve this problem. Instead, we must employ some type of stack-like data structure to give our program memory of past characters. The first three lines of the main function are just a cheap way to replace blockers with a single character. A better solution would be to tokenize txt
so that this isn't necessary.
def removeBlocks(txt: String, startBlocker: String, endBlocker: String): String = {
val left = "["
val right = "]"
val filteredTxt = txt.replace(startBlocker, left).replace(endBlocker, right)
def func(xs: List[String], stk: List[String]): List[String] = xs match {
case Nil => Nil
case `left` :: tail => func(tail, left :: stk)
case `right` :: tail => func(tail, stk.tail)
case head :: tail =>
if (stk.nonEmpty && stk.head == left)
func(tail, stk)
else
head :: func(tail, stk)
}
func((filteredTxt map(_.toString)).toList, Nil).mkString
}
-
\$\begingroup\$ Thank you, replacing the blockers with constant one-character strings is a great idea. Now I just need to find two characters that can't appear in the text, or alternatively replace all the occurrences of "[" and "]" with some improbable string, and then replace back in the end. \$\endgroup\$escargot agile– escargot agile2014年12月08日 06:14:33 +00:00Commented Dec 8, 2014 at 6:14
The simplest thing that comes to mind, provided you're willing to escape special regex chars (like the braces in your examples) yourself (or use Apache Commons, see below) would be this:
def removeBlocks(text: String, startMarker: String, endMarker: String) = {
val pattern = s"${startMarker}(.*)${endMarker}"
text.replaceAll(pattern, "")
}
scala> removeBlocks("123<div>456</div>789", "<div>", "</div>")
res20: String = 123789
scala> removeBlocks("123{456}789", "\\{", "\\}")
res19: String = 123789
If you were willing to use Apache Commons, commons lang has StringEscapeUtils which you could use to remove that issue.
-
1\$\begingroup\$ Regex is not strong enough for this task, because regex is a regular language. Try this:
removeBlocks("123{a{b{c}b}a}456{a}789", "{", "}")
. This should give123456789
, but no regex will do this job well. \$\endgroup\$escargot agile– escargot agile2014年12月03日 20:08:47 +00:00Commented Dec 3, 2014 at 20:08 -
\$\begingroup\$ Ack, you're right. It looks like other people are jumping in w/ answers so I'll leave it to them \$\endgroup\$geoffjentry– geoffjentry2014年12月06日 16:05:10 +00:00Commented Dec 6, 2014 at 16:05
Explore related questions
See similar questions with these tags.
removeBlocks
example. Wouldn't you be left with 123<<<789 \$\endgroup\$