Showing posts with label regular expression. Show all posts
Showing posts with label regular expression. Show all posts
Wednesday, February 3, 2010
Regex ReplaceAllIn
Note: Updated on Feb 13th for the newer API on Scala 2.8 trunk. (This is life on the bleeding edge, thanks Daniel).
A couple new methods have just been added to Scala 2.8 Regex. You will need to download a version of Scala 2.8 more recent than Scala2.8-Beta1.
The methods are related to replacing text using a regular expression and to say they are useful is an understatement. Lets take a look:
A couple new methods have just been added to Scala 2.8 Regex. You will need to download a version of Scala 2.8 more recent than Scala2.8-Beta1.
The methods are related to replacing text using a regular expression and to say they are useful is an understatement. Lets take a look:
- scala> val quote = """I don't like to commit myself about heaven and hell - you see, I have friends in both places.
- | Mark Twain"""
- quote: java.lang.String =
- I don't like to commit myself about heaven and hell - you see, I have friends in both places.
- Mark Twain
- scala> val expr = "e".r
- expr: scala.util.matching.Regex = e
- /*
- This first method is not new or is it interesting. But the new methods are both related
- so lets start with the basic form of replaceAllIn
- */
- scala> expr.replaceAllIn(quote, "**")
- res1: String =
- I don't lik** to commit mys**lf about h**av**n and h**ll - you s****, I hav** fri**nds in both plac**s.
- Mark Twain
- // this does the same thing
- scala> quote.replaceAll("e","**")
- res2: java.lang.String =
- I don't lik** to commit mys**lf about h**av**n and h**ll - you s****, I hav** fri**nds in both plac**s.
- Mark Twain
- /*
- Now things get interesting. Using this form of replaceAllIn we can determine the replacement on a case by case basis.
- It provides the Match object as the parameter so you have complete access to all
- the matched groups, the location of the match etc...
- The method takes a Match => String function. Very, very powerful.
- */
- scala> expr.replaceAllIn(quote, s => if(util.Random.nextBoolean) "?" else "*")
- res5: String =
- I don't lik? to commit mys?lf about h?av?n and h?ll - you s*?, I hav? fri*nds in both plac*s.
- Mark Twain
- /*
- Another example using some of the matcher functionality
- */
- scala> expr.replaceAllIn(quote, m => m.start.toString)
- res6: String =
- I don't lik11 to commit mys26lf about h37av40n and h48ll - you s5960, I hav68 fri73nds in both plac90s.
- Mark Twain
- /*
- Another crazy useful method is the replaceSomeIn. It is similar to the replaceAllIn that takes a function except that the function in replaceSomeIn returns an Option. If None then there is no replacement. Otherwise a replacement is performed. Very nice when dealing with complex regular expressions.
- In this example we are replacing all 'e's start are before the 50th character in the string with -
- */
- scala> expr.replaceSomeIn(quote, m => if(m.start > 50) None else Some("-"))
- res3: String =
- I don't lik- to commit mys-lf about h-av-n and h-ll - you see, I have friends in both places.
- Mark Twain
Labels:
intermediate,
regex,
regular expression,
Scala
Monday, January 11, 2010
Regular Expression 3: Regex matching
This post covers basically the same things as Matching Regular Expressions but goes into a bit more detail. I recommend reading both posts since there is unique information in each.
The primary new item I show here is that more advanced matching techniques can be used but more importantly all groups are matched even groups that are within another group.
Note: The examples use Scala 2.8. Most examples will work with 2.7 but I believe the last example is Scala 2.8 only.
The following are a few more complex examples
The primary new item I show here is that more advanced matching techniques can be used but more importantly all groups are matched even groups that are within another group.
Note: The examples use Scala 2.8. Most examples will work with 2.7 but I believe the last example is Scala 2.8 only.
- scala> val date = "11/01/2010"
- date: java.lang.String = 11/01/2010
- scala> val Date = """(\d\d)/(\d\d)/(\d\d\d\d)""".r
- Date: scala.util.matching.Regex = (\d\d)/(\d\d)/(\d\d\d\d)
- /*
- When a Regex object is used in matching each group is assigned to a variable
- */
- scala> val Date(day, month, year) = date
- day: String = 11
- month: String = 01
- year: String = 2010
- scala> val Date = """(\d\d)/((\d\d)/(\d\d\d\d))""".r
- Date: scala.util.matching.Regex = (\d\d)/((\d\d)/(\d\d\d\d))
- /*
- This example demonstates how all groups must be assigned, if not there will be a matchError thrown
- */
- scala> val Date(day, monthYear, month, year) = date
- day: String = 11
- monthYear: String = 01/2010
- month: String = 01
- year: String = 2010
- scala> val Date(day, month, year) = date
- scala.MatchError: 11/01/2010
- at .< init>(< console>:5)
- at .< clinit>(< console>)
- // but placeholders work in Regex matching as well:
- scala> val Date(day, _, month, year) = date
- day: String = 11
- month: String = 01
- year: String = 2010
- scala> val Names = """(\S+) (\S*)""".r
- Names: scala.util.matching.Regex = (\S+) (\S*)
- scala> val Names(first, second) = "Jesse Eichar"
- first: String = Jesse
- second: String = Eichar
- /*
- If you want to use Regex's in assignment you must be sure the match will work. Otherwise you should do real matching
- */
- scala> val Names(first, second) = "Jesse"
- scala.MatchError: Jesse
- at .< init>(< console>:5)
- at .< clinit>(< console>)
- scala> val M = """\d{3}""".r
- M: scala.util.matching.Regex = \d{3}
- /*
- There must be a group in the Regex or match will fail
- */
- scala> val M(m) = "Jan"
- scala.MatchError: Jan
- at .< init>(< console>:5)
- at .< clinit>(< console>)
The following are a few more complex examples
- scala> val Date = """((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))""".r
- Date: scala.util.matching.Regex = ((\d\d)/(\d\d)/(\d{4}))|((\w{3}) (\d\d),\s?(\d{4}))
- /*
- The Regex has an or in it. So only 1/2 of the groups will be non-null.
- If the first group is a String then it is non-null and the next three elements
- the pattern will be day/month/year
- Otherwise if the 5th group is a String then the patter will be month day, year
- Lastly a catch all
- */
- scala> def printDate(date:String) = date match {
- | case Date(_:String,day,month,year,_,_,_,_) => (day,month,year)
- | case Date(_,_,_,_,_:String,month,day,year) => (day,month,year) // process month
- | case _ => ("x","x","x")
- | }
- printDate: (date: String)(String, String, String)
- scala> printDate("Jan 01,2010")
- res0: (String, String, String) = (01,Jan,2010)
- scala> printDate("01/01/2010")
- res1: (String, String, String) = (01,01,2010)
- /*
- A silly example which drops the first element of the date string
- not useful but this demonstrates that we are matching agains a sequence so
- the _* can be used to match the rest of the groups
- */
- scala> def split(date:String) = date match {
- | case d @ Date(_:String ,_*) => d drop 3
- | case d @ Date(_,_,_,_,_:String,_*) => d drop 4
- | case _ => "boom"
- | }
- split: (date: String)String
- scala> split ("Jan 31,2004")
- res5: String = 31,2004
- scala> split ("11/12/2004")
- res6: String = 12/2004
- /*
- This is just a reminder that the findAllIn returns an iterator which (since it is probably a short iterator) can be converted to a sequence and processed with matching
- */
- scala> val Seq(one,two,_*) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq
- one: String = 11/
- two: String = 01/
- scala> val Seq(one,two) = ("""\d\d/""".r findAllIn "11/01/2010" ).toSeq
- one: String = 11/
- two: String = 01/
- // drop the two first matches and assign the rest to d
- scala> val Seq(_,_,d @ _*) = ("""\d\d/""".r findAllIn "11/01/20/10/" ).toSeq
- d: Seq[String] = ArrayBuffer(20/, 10/)
Labels:
match,
MatchError,
matching,
regex,
regular expression,
Scala
Friday, January 8, 2010
Regular Expression 2 : The rest Regex class
This is the second installment of Regular expressions in Scala. In the first installment the basics were shown and a few of the methods in the Regex class were inspected. This topic will look at the rest of the methods in the Regex class.
Regex.findPrefixMatchOf
Regex.replaceAllIn -- Essentially the same as using String.replaceAll
Regex.replaceFirstIn -- Essentially the same as using String.replaceFirst
This next section is not Scala specific but because Regex does not provide a way to set the flags CASE_INSENSITIVE, DOTALL, etc... The section is useful to demonstrate how to do it as part of the standard regex syntax.
Regex.findPrefixMatchOf
- /*
- returns the match if the regex is the prefix of the string
- */
- scala> "(h)(e)|(l)".r findPrefixMatchOf "hello xyz"
- res2: Option[scala.util.matching.Regex.Match] = Some(he)
- scala> "lo".r findPrefixMatchOf "hello xyz"
- res3: Option[scala.util.matching.Regex.Match] = None
- /*
- The method is essentially the same as adding the boundary regex character
- */
- scala> "^ab".r findFirstMatchIn "ababab"
- res8: Option[scala.util.matching.Regex.Match] = Some(ab)
- scala> "^ab".r findFirstMatchIn "hababab"
- res9: Option[scala.util.matching.Regex.Match] = None
- /*
- findPrefixOf is the same but returns the matched string instead
- */
- scala> "ab".r findPrefixOf "haababab"
- res11: Option[String] = None
- scala> "ab".r findPrefixOf "ababab"
- res12: Option[String] = Some(ab)
Regex.replaceAllIn -- Essentially the same as using String.replaceAll
Regex.replaceFirstIn -- Essentially the same as using String.replaceFirst
- scala> "(h)(e)|(l)".r replaceAllIn ("hello xyz","__")
- res13: String = ______o xyz
- scala> "hello xyz" replaceAll ("(h)(e)|(l)","__")
- res14: java.lang.String = ______o xyz
- scala> "hello xyz" replaceFirst ("(h)(e)|(l)","__")
- res16: java.lang.String = __llo xyz
- scala> "(h)(e)|(l)".r replaceFirstIn ("hello xyz","__")
- res17: String = __llo xyz
This next section is not Scala specific but because Regex does not provide a way to set the flags CASE_INSENSITIVE, DOTALL, etc... The section is useful to demonstrate how to do it as part of the standard regex syntax.
- // examples based on java blog at: <a href="http://www.javaranch.com/journal/2003/04/RegexTutorial.htm#flags">http://www.javaranch.com/journal/2003/04/RegexTutorial.htm#flags</a>
- scala> val input = """Hey, diddle, diddle,
- | |The cat and the fiddle,
- | |The cow jumped over the moon.
- | |The little dog laughed
- | |To see such sport,
- | |And the dish ran away with the spoon.""".stripMargin
- input: String =
- Hey, diddle, diddle,
- The cat and the fiddle,
- The cow jumped over the moon.
- The little dog laughed
- To see such sport,
- And the dish ran away with the spoon.
- // by default regex is case sensitive
- scala> """the \w+?(?=\W)""".r findAllIn input foreach (println _)
- the fiddle
- the moon
- the dish
- the spoon
- /* the (?i) makes the match case insensitive the complete set of options are:
- (?idmsux)
- i - case insensitive
- d - only unix lines are recognized as end of line
- m - enable multiline mode
- s - . matches any characters including line end
- u - Enables Unicode-aware case folding
- x - Permits whitespace and comments in pattern
- */
- scala> """(?i)the \w+?(?=\W)""".r findAllIn input foreach (println _)
- The cat
- the fiddle
- The cow
- the moon
- The little
- the dish
- the spoon
Labels:
regex,
regular expression,
Scala
Wednesday, January 6, 2010
Regular Expression 1 : Basics and findAllIn
This is the first of a series on Regular expression use in Scala. There was a previous post related that is also worth looking at but the same tips will be revisited in this series: Matching Regular Expressions.
Perhaps the most important thing for regular expressions in Scala is to be aware of the raw string syntax:
The next thing is to realize that one can easily create a Regex object from any string with the
A regex has the standard matching methods one might expect but lets look at findAllIn and the associated MatchData for now. findAllIn returns a MatchIterator which is an Iterator[String] with MatchData. When iterating over the MatchIterator the full matched string will be returned, if you need the subgroups you will need to convert the MatchIterator to an Iterator[Match].
The last methods to look at for this topic are findFirstIn, findFirstMatchIn and the Regex constructor:
Perhaps the most important thing for regular expressions in Scala is to be aware of the raw string syntax:
- /*
- normal strings treat the \ character as the escape character
- so this fails
- */
- scala> val normalString = "\.+((xyz)|(abc))"
- < console>:1: error: invalid escape character
- val normalString = "\.+((xyz)|(abc))"
- ^
- /*
- raw strings a great for regular expressions so you don't have
- escape \ characters
- */
- scala> val rawString = """\.+((xyz)|(abc))"""
- rawString: java.lang.String = \.+((xyz)|(abc))
The next thing is to realize that one can easily create a Regex object from any string with the
r
method:- scala> val regex = """\.+((xyz)|(abc))""".r
- regex: scala.util.matching.Regex = \.+((xyz)|(abc))
A regex has the standard matching methods one might expect but lets look at findAllIn and the associated MatchData for now. findAllIn returns a MatchIterator which is an Iterator[String] with MatchData. When iterating over the MatchIterator the full matched string will be returned, if you need the subgroups you will need to convert the MatchIterator to an Iterator[Match].
- // findAllIn returns an iterator over the matches.
- scala> "l|he".r findAllIn "hello xyz" foreach {println _}
- he
- l
- l
- // Each match can have multiple groups
- // Note: Each element in MatchIterator are strings (no Match objects)
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched mkString ",")}
- h,e
- l
- l
- // to access subgroups use the matchData method
- // Note: there are 3 subgroups in the regex
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println( m.subgroups mkString ",")}
- h,e,null
- null,null,l
- null,null,l
- /*
- if matched is called the full match is returned (as if you did not convert the iterator to an Iterator[Match])
- /*
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.matched)}
- he
- l
- l
- /*
- The following demonstrates more of the methods on Match
- Essentially the elements are:
- (start index of match, end index of match, string before match, string after match, string the match was performed on)
- */
- scala> ("(h)(e)|(l)".r findAllIn "hello xyz").matchData foreach { m => println(m.start, m.end, m.before, m.after, m.source)}
- (0,2,,llo xyz,hello xyz)
- (2,3,he,lo xyz,hello xyz)
- (3,4,hel,o xyz,hello xyz)
The last methods to look at for this topic are findFirstIn, findFirstMatchIn and the Regex constructor:
- /*
- Groups names can be assigned if the Regex constructor is used
- */
- scala> val withNames = new util.matching.Regex("(h)(e)|(l)", "h", "e", "l")
- withNames: scala.util.matching.Regex = (h)(e)|(l)
- scala> withNames findFirstIn "hello xyz"
- res28: Option[String] = Some(he)
- /*
- I know a match will be found so I am extracting the value from the Option by assigning it to Some(he)
- */
- scala> val Some(he) = withNames findFirstMatchIn "hello xyz"
- he: scala.util.matching.Regex.Match = he
- scala> he.groupNames
- res29: Seq[String] = Array(h, e, l)
- scala> he.group("h")
- res30: String = h
- scala> he.group("e")
- res31: String = e
- scala> he.group(1)
- res32: String = h
- scala> he.group(2)
- res33: String = e
- // Uh oh. NullPointer warning!
- scala> he.group(3)
- res34: String = null
- scala> he.groupCount
- res35: Int = 3
Labels:
intermediate,
regex,
regular expression,
Scala
Matching in for-comprehensions
At a glance a for-comprehension appears to be equivalent to a Java for-loop, but it is much much more than that. As shown in post: for-comprehensions, for-comprehensions can have guards which filter out which elements are processed:
They can be used to construct new collections:
They can contain multiple generators:
What has not been covered is that the assignments also does pattern matching:
This is not surprising as this also occurs during normal assignment. But what is interesting is that the pattern matching can act as a guard as well. See Extractor examples and Assignment and Parameter Objects for more information of pattern matching and extractors.
Now just for fun here is a similar example but using symbols instead of strings for the key values:
- scala> for ( x <- 1 to 10; if (x >4) ) println(x)
- 5
- 6
- 7
- 8
- 9
- 10
They can be used to construct new collections:
- scala>?for(?i?<-?List(?"a",?"b",?"c")?)?yield?"Word:?"+i
- res1:?List[java.lang.String]?=?List(Word:?a,?Word:?b,?Word:?c)
They can contain multiple generators:
- scala> for {x <- 1 to 10
- | if(x%2 == 0)
- | y <- 1 to 5} yield (x,y)
- res1: scala.collection.immutable.IndexedSeq[(Int, Int)] = IndexedSeq((2,1), (2,2), (2,3), (2,4), (2,5), (4,1), (4,2), (4,3), (4,4), (4,5), (6,1), (6,2), (6,3), (6,4), (6,5), (8,1), (8,2), (8,3), (8,4), (8,5), (10,1), (10,2), (10,3), (10,4), (10,5))
What has not been covered is that the assignments also does pattern matching:
- scala> for ( (x,y) <- (6 to 1 by -2).zipWithIndex) println (x,y)
- (6,0)
- (4,1)
- (2,2)
This is not surprising as this also occurs during normal assignment. But what is interesting is that the pattern matching can act as a guard as well. See Extractor examples and Assignment and Parameter Objects for more information of pattern matching and extractors.
- scala> val args = Array( "h=2", "b=3")
- args: Array[java.lang.String] = Array(h=2, b=3)
- scala> val Property = """(.+)=(.+)""".r
- Property: scala.util.matching.Regex = (.+)=(.+)
- scala> for {Property(key,value) <- args } yield (key,value)
- res0: Array[(String, String)] = Array((h,2), (b,3))
- scala> Map(res0:_*)
- res1: scala.collection.immutable.Map[String,String] = Map(h -> 2, b -> 3)
- scala> res1("h")
- res3: String = 2
Now just for fun here is a similar example but using symbols instead of strings for the key values:
- scala> val args = Array( "h=2", "b=3")
- args: Array[java.lang.String] = Array(h=2, b=3)
- scala> val Property = """(.+)=(.+)""".r
- Property: scala.util.matching.Regex = (.+)=(.+)
- scala> for {Property(key,value) <- args } yield (Symbol(key),value)
- res0: Array[(Symbol, String)] = Array(('h,2), ('b,3))
- scala> Map(res0:_*)
- res1: scala.collection.immutable.Map[Symbol,String] = Map('h -> 2, 'b -> 3)
- scala> res1('h)
- res2: String = 2
Labels:
for-comprehension,
intermediate,
matching,
regex,
regular expression,
Scala,
symbol
Wednesday, September 16, 2009
Matching Regular expressions
This topic is derived from the blog post: Using pattern matching with regular expressions in Scala
The Regex class in Scala provides a very handy feature that allows you to match against regular expressions. This makes dealing with certain types of regular expression very clean and easy to follow.
What needs to be done is to create a Regex class and assign it to a val. It is recommended that the val starts with an Uppercase letter, see the topic of matching about the assumptions matching makes based on the first letter of the Match case clause.
There is nothing like examples to help explain an idea:
The Regex class in Scala provides a very handy feature that allows you to match against regular expressions. This makes dealing with certain types of regular expression very clean and easy to follow.
What needs to be done is to create a Regex class and assign it to a val. It is recommended that the val starts with an Uppercase letter, see the topic of matching about the assumptions matching makes based on the first letter of the Match case clause.
There is nothing like examples to help explain an idea:
- // I normally use raw strings (""") for regular expressions so that I don't have to escape all \ characters
- // There are two ways to create Regex objects.
- // 1. Use the RichString's r method
- // 2. Create it using the normal Regex constructor
- scala> val Name = """(\w+)\s+(\w+)""".r
- Name: scala.util.matching.Regex = (\w+)\s+(\w+)
- scala> import scala.util.matching._
- import scala.util.matching._
- // Notice the val name starts with an upper case letter
- scala> val Name = new Regex("""(\w+)\s+(\w+)""")
- Name: scala.util.matching.Regex = (\w+)\s+(\w+)
- scala> "Jesse Eichar" match {
- | case Name(first,last) => println("found: ", first, last)
- | case _ => println("oh no!")
- | }
- (found: ,Jesse,Eichar)
- scala> val FullName = """(\w+)\s+(\w+)\s+(\w+)""".r
- FullName: scala.util.matching.Regex = (\w+)\s+(\w+)\s+(\w+)
- // If you KNOW that the match will work you can assign it to a variable
- // Only do this if you are sure the match will work otherwise you will get a MatchError
- scala> val FullName(first, middle, last) = "Jesse Dale Eichar"
- first: String = Jesse
- middle: String = Dale
- last: String = Eichar
Labels:
intermediate,
match,
MatchError,
matching,
raw strings,
regex,
regular expression,
Scala
Subscribe to:
Posts (Atom)