I've written a program that parses JSON downloaded from Facebook to calculate some stats from a conversation between two parties. It attemps to count these things:
- Messages grouped by day of week
- Amount of messages per person
- Messages grouped by year-month
- Messages grouped by month
- Messages grouped by distinct date
- Messages grouped by hour of day
- The day with most amount of messages
- Count frequency of words
- Average amount of messages per day
Example JSON
{
"messages": [
{
"sender_name": "Sender One",
"timestamp": 1526128030,
"content": "Hello, how are you?",
"type": "Generic"
},
{
"sender_name": "Sender Two",
"timestamp": 1526128030,
"content": "Hi! I am very well. Thank you for asking",
"type": "Generic"
}
]
}
Code
import io.circe.generic.extras.{Configuration, semiauto}
import io.circe.{Decoder, Encoder}
import java.time.Instant.ofEpochSecond
import java.time.LocalDateTime.ofInstant
import java.time._
case class Messages(messages: Seq[Message])
object Messages {
implicit val jsonConfiguration: Configuration = Configuration.default
implicit val decoder: Decoder[Messages] = semiauto.deriveDecoder
implicit val encoder: Encoder[Messages] = semiauto.deriveEncoder
}
case class Message(senderName: String, timestamp: Long, content: Option[String], `type`: String)
object Message {
implicit val jsonConfiguration: Configuration = Configuration.default.withSnakeCaseMemberNames
implicit val decoder: Decoder[Message] = semiauto.deriveDecoder
implicit val encoder: Encoder[Message] = semiauto.deriveEncoder
}
object MessageAnalyticService {
def senderMsgCount(a: Messages, senderName: String): Int = {
a.messages.count(msg => msg.senderName == senderName)
}
def dayMaxMessages(a: Messages): (LocalDate, Int) = {
a.messages
.groupBy(m => toLocalDate(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq.maxBy(_._2)
}
def byDay(a: Messages): Seq[(LocalDate, Int)] = {
a.messages
.groupBy(m => toLocalDate(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq
.sortWith(_._1 isAfter _._1)
}
def dayAverage(a: Messages): Int = {
val r = a.messages
.groupBy(m => toLocalDate(m.timestamp))
.map(t => t._2.length)
val average = r.sum / r.size
average
}
def byYearMonth(a: Messages): Seq[(YearMonth, Int)] = {
a.messages
.groupBy(m => toYearMonth(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq
.sortWith(_._1 isAfter _._1)
}
def byHourOfDay(a: Messages): Seq[(LocalTime, Int)] = {
a.messages
.groupBy(m => toHour(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq
.sortBy(_._1)
}
def byWeekDay(a: Messages): Seq[(DayOfWeek, Int)] = {
a.messages
.groupBy(m => toDayOfWeek(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq
.sortBy(_._1)
}
def byWord(a: Messages): Seq[(String, Int)] = {
a.messages
.flatMap(m => m.content)
.mkString(" ")
.toLowerCase
.replaceAll("(\\')", "")
.replaceAll("(\\.)|(\\?)|(\\!)|(\\\")|(\\;)", " ")
.split(" ")
.groupBy(w => w)
.map(e => e._1 -> e._2.length)
.toSeq
.sortWith(_._2 > _._2)
}
private def toDayOfWeek(epoch: Long): DayOfWeek = {
DayOfWeek.from(ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault()))
}
private def toHour(epoch: Long): LocalTime = {
LocalTime.from(ofInstant(ofEpochSecond(epoch - (epoch % 3600)), ZoneId.systemDefault()))
}
private def toYearMonth(epoch: Long): YearMonth = {
YearMonth.from(ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault()))
}
private def toLocalDate(epoch: Long): LocalDate = {
ofEpochSecond(epoch).atZone(ZoneId.systemDefault).toLocalDate
}
}
Example main
object ApplicationMain extends App {
val jsonString = Source.fromURL(getClass.getResource("/message_mihaooo.json")).mkString
parseString(jsonString) match {
case Right(r) =>
val words = byWord(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(words, "by_word.csv")
val s = byDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(s, "by_day.csv")
val bWD = byWeekDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(bWD, "by_week_day.csv")
val bYM = byYearMonth(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(bYM, "by_year_month.csv")
val bHOD = byHourOfDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(bHOD, "by_hour_of_day.csv")
CsvWriter.writeToFile(senderMsgCount(r, "PARTICIPANT 1").toString, "message_count_participant_one.csv")
CsvWriter.writeToFile(senderMsgCount(r, "PARTICIPANT 2").toString, "message_count_participant_two.csv")
CsvWriter.writeToFile(s"${dayMaxMessages(r)._1}, ${dayMaxMessages(r)._2}", "max.csv")
case Left(e) => println(e)
}
}
Github with project: https://github.com/truongio/messie
-
\$\begingroup\$ I'm not really a scala coder, but it all looks fine. Your functions are succinct, you use proper streaming manipulation functions, and you keep your code simple, even if there's maybe a tiny bit of redundancy. I don't see much that could be substantially improved, besides maybe moving a few constant strings into named constants (such as the resource path). \$\endgroup\$scnerd– scnerd2018年06月06日 18:52:45 +00:00Commented Jun 6, 2018 at 18:52
2 Answers 2
First of all, it should be underlined that the code in the question is quite neat, clean, readable and adequate to the task to solve.
However, I have several suggestions, but they are more a kind of polishing the existing code.
Messages
I think that case class Messages
does not make sense. Its only function is to wrap the collection of Message
items, which sometimes can make code more complex.
It neither makes sense from Circe's parsing point of view, because collections are supported out of the box and a direct usage of Seq[Message]
should not cause problems.
If there are too many occurrences of Seq[Message]
in the code, they can be aliased:
object MessageAnalyticService {
type Messages = Seq[Message]
def senderMsgCount(a: Messages, senderName: String): Int = {
a.count(msg => msg.senderName == senderName)
}
def dayMaxMessages(a: Messages): (LocalDate, Int) = {
a.groupBy(m => toLocalDate(m.timestamp))
.map(t => t._1 -> t._2.length)
.toSeq.maxBy(_._2)
}
...
}
And there is an immediate benefit: each function has one line less now, without any decrease in readability!
Temporal Values
One should be careful when using temporal values with time zones. In MessageAnalyticService
calls like
ZonedDateTime.ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault())
It would be OK if the system default zone id is UTC.
Supposing (and that should be the case) that the input JSON contains timestamps in UTC in "timestamp"
fields.
If the zone id of the host where the code is executed is different from UTC, the converted ZonedDateTime
will contain time zone offset and thus the values returned from these private temporal def
s will be relative to the zone of the host, but not to the original timestamp zone. This may bring surprising results, especially difficult to debug. So I'd suggest replacing ZoneId.systemDefault()
with ZoneId.of("UTC")
.
Regular Expressions
def byWord
contains weird regular expression args to .replaceAll
calls: no need to use groups, |
or too many escape chars:
.replaceAll("'", "")
.replaceAll("[.?!\\\\;]", " ")
(ugly escapes only for the ugly backslashes).
Strings Processing
The same def byWord
might produce a very huge string:
a.messages
.flatMap(m => m.content)
.mkString(" ") // <-- here!
.toLowerCase
This will be a single string representing a concatenation of all the content
strings in all the messages. And it will be split by space anyway further in the chaining. I think that this sequence will be more efficient if the replacements were made on each content string, then split into separate words. This could look like:
private def extractWords(content: Option[String]): Seq[String] =
content.map(c =>
c.replaceAll("'", "")
.replaceAll("[.?!\\\\;]", " ")
.toLowerCase
.split(" ").toSeq
).getOrElse(Nil)
def byWord(a: Messages): Seq[(String, Int)] = {
a.flatMap(m => extractWords(m.content))
.groupBy(w => w)
.map(e => e._1 -> e._2.length)
.toSeq
.sortWith(_._2 > _._2)
}
Repetitive Code
In ApplicationMain
, these calls are repetitive:
val words = byWord(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(words, "by_word.csv")
They can be generalized into a function that writes into CSV:
private def writeAsCsv[T](messages: Messages,
analyticsFunction: Messages => Seq[(T, Int)],
fileName: String): Unit = {
val contents = analyticsFunction(messages).map(p => s"${p._1}; ${p._2}").mkString("\n")
CsvWriter.writeToFile(contents, fileName)
}
And the matcher sequence will be transformed into:
case Right(r) =>
writeAsCsv(r, byWord, "by_word.csv")
writeAsCsv(r, byDay, "by_day.csv")
writeAsCsv(r, byWeekDay, "by_week_day.csv")
writeAsCsv(r, byYearMonth, "by_year_month.csv")
writeAsCsv(r, byHourOfDay, "by_hour_of_day.csv")
// etc
and can even be transformed further into a loop.
Either
vs Try
Finally, a really minor thing.
Although this is a valid usage of matching by Either
, I'd remind about a simpler Try
matcher:
def parseString2(json: String): Messages = ???
Try(parseString2(jsonString)) match {
case Success(r) =>
writeAsCsv(r, byWord, "by_word.csv")
// etc
case Failure(ex) => println(ex.getMessage)
}
It looks more suitable here: no specificity or benefits of Either/Left/Right
is seen in this code.
-
\$\begingroup\$ Hi, thank you very much for the valuable feedback! For the case class and Either vs Try, it's because how I parse the JSON with Circe. See github.com/truongio/messie/blob/…. Maybe there is an easier way. Isn't Either right biased in newer Scala, so Try is just an Either that has throwable on Left? I probably made a mistake on the time zone. It's hard to know. I got the messages from Facebook. My timezone is UTC+2. \$\endgroup\$uraza– uraza2018年06月10日 18:53:29 +00:00Commented Jun 10, 2018 at 18:53
-
\$\begingroup\$ Apparently the timestamps were in UTC+2, same as my computer. I guess that makes sense considering it's personal data from Facebook and they know where you are from. \$\endgroup\$uraza– uraza2018年06月11日 07:04:33 +00:00Commented Jun 11, 2018 at 7:04
For advice on Scala (of which I'm no expert) I think Antot's answer does a great job.
However I noticed that your byWord function removes all ' characters which might give results you did not intend, e.g. in the sentence
"The Home Owner's Best Friend is a magazine that gives practical advice for the ambitious home owner"
your function would count "owner" once although it is in the sentence twice. The other occurrence is counted as "owners". This also raises the question, if plurals are counted separately from their singular forms.
Language is complex even if you are only counting words. If this potential error matters you may want to look for a library to do the word counting ...
A quick fix would be to replace ' by space and not count single letters.