Program to calculate some stats from Facebook messages

Question 1

I've written a program that parses JSON downloaded from Facebook to calculate some stats from a conversation between two parties. It attemps to count these things:

Messages grouped by day of week
Amount of messages per person
Messages grouped by year-month
Messages grouped by month
Messages grouped by distinct date
Messages grouped by hour of day
The day with most amount of messages
Count frequency of words
Average amount of messages per day

Example JSON

{
 "messages": [
 {
 "sender_name": "Sender One",
 "timestamp": 1526128030,
 "content": "Hello, how are you?",
 "type": "Generic"
 },
 {
 "sender_name": "Sender Two",
 "timestamp": 1526128030,
 "content": "Hi! I am very well. Thank you for asking",
 "type": "Generic"
 }
 ]
}

Code

import io.circe.generic.extras.{Configuration, semiauto}
import io.circe.{Decoder, Encoder}
import java.time.Instant.ofEpochSecond
import java.time.LocalDateTime.ofInstant
import java.time._
case class Messages(messages: Seq[Message])
object Messages {
 implicit val jsonConfiguration: Configuration = Configuration.default
 implicit val decoder: Decoder[Messages] = semiauto.deriveDecoder
 implicit val encoder: Encoder[Messages] = semiauto.deriveEncoder
}
case class Message(senderName: String, timestamp: Long, content: Option[String], `type`: String)
object Message {
 implicit val jsonConfiguration: Configuration = Configuration.default.withSnakeCaseMemberNames
 implicit val decoder: Decoder[Message] = semiauto.deriveDecoder
 implicit val encoder: Encoder[Message] = semiauto.deriveEncoder
}
object MessageAnalyticService {
 def senderMsgCount(a: Messages, senderName: String): Int = {
 a.messages.count(msg => msg.senderName == senderName)
 }
 def dayMaxMessages(a: Messages): (LocalDate, Int) = {
 a.messages
 .groupBy(m => toLocalDate(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq.maxBy(_._2)
 }
 def byDay(a: Messages): Seq[(LocalDate, Int)] = {
 a.messages
 .groupBy(m => toLocalDate(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq
 .sortWith(_._1 isAfter _._1)
 }
 def dayAverage(a: Messages): Int = {
 val r = a.messages
 .groupBy(m => toLocalDate(m.timestamp))
 .map(t => t._2.length)
 val average = r.sum / r.size
 average
 }
 def byYearMonth(a: Messages): Seq[(YearMonth, Int)] = {
 a.messages
 .groupBy(m => toYearMonth(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq
 .sortWith(_._1 isAfter _._1)
 }
 def byHourOfDay(a: Messages): Seq[(LocalTime, Int)] = {
 a.messages
 .groupBy(m => toHour(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq
 .sortBy(_._1)
 }
 def byWeekDay(a: Messages): Seq[(DayOfWeek, Int)] = {
 a.messages
 .groupBy(m => toDayOfWeek(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq
 .sortBy(_._1)
 }
 def byWord(a: Messages): Seq[(String, Int)] = {
 a.messages
 .flatMap(m => m.content)
 .mkString(" ")
 .toLowerCase
 .replaceAll("(\\')", "")
 .replaceAll("(\\.)|(\\?)|(\\!)|(\\\")|(\\;)", " ")
 .split(" ")
 .groupBy(w => w)
 .map(e => e._1 -> e._2.length)
 .toSeq
 .sortWith(_._2 > _._2)
 }
 private def toDayOfWeek(epoch: Long): DayOfWeek = {
 DayOfWeek.from(ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault()))
 }
 private def toHour(epoch: Long): LocalTime = {
 LocalTime.from(ofInstant(ofEpochSecond(epoch - (epoch % 3600)), ZoneId.systemDefault()))
 }
 private def toYearMonth(epoch: Long): YearMonth = {
 YearMonth.from(ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault()))
 }
 private def toLocalDate(epoch: Long): LocalDate = {
 ofEpochSecond(epoch).atZone(ZoneId.systemDefault).toLocalDate
 }
}

Example main

object ApplicationMain extends App {
 val jsonString = Source.fromURL(getClass.getResource("/message_mihaooo.json")).mkString
 parseString(jsonString) match {
 case Right(r) =>
 val words = byWord(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(words, "by_word.csv")
 val s = byDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(s, "by_day.csv")
 val bWD = byWeekDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(bWD, "by_week_day.csv")
 val bYM = byYearMonth(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(bYM, "by_year_month.csv")
 val bHOD = byHourOfDay(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(bHOD, "by_hour_of_day.csv")
 CsvWriter.writeToFile(senderMsgCount(r, "PARTICIPANT 1").toString, "message_count_participant_one.csv")
 CsvWriter.writeToFile(senderMsgCount(r, "PARTICIPANT 2").toString, "message_count_participant_two.csv")
 CsvWriter.writeToFile(s"${dayMaxMessages(r)._1}, ${dayMaxMessages(r)._2}", "max.csv")
 case Left(e) => println(e)
 }
}

Github with project: https://github.com/truongio/messie

Question 2

I'm not really a scala coder, but it all looks fine. Your functions are succinct, you use proper streaming manipulation functions, and you keep your code simple, even if there's maybe a tiny bit of redundancy. I don't see much that could be substantially improved, besides maybe moving a few constant strings into named constants (such as the resource path).

Question 3

First of all, it should be underlined that the code in the question is quite neat, clean, readable and adequate to the task to solve.

However, I have several suggestions, but they are more a kind of polishing the existing code.

`Messages`

I think that case class Messages does not make sense. Its only function is to wrap the collection of Message items, which sometimes can make code more complex.

It neither makes sense from Circe's parsing point of view, because collections are supported out of the box and a direct usage of Seq[Message] should not cause problems.

If there are too many occurrences of Seq[Message] in the code, they can be aliased:

object MessageAnalyticService {
 type Messages = Seq[Message]
 def senderMsgCount(a: Messages, senderName: String): Int = {
 a.count(msg => msg.senderName == senderName)
 }
 def dayMaxMessages(a: Messages): (LocalDate, Int) = {
 a.groupBy(m => toLocalDate(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq.maxBy(_._2)
 }
 ...
}

And there is an immediate benefit: each function has one line less now, without any decrease in readability!

Temporal Values

One should be careful when using temporal values with time zones. In MessageAnalyticService calls like

ZonedDateTime.ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault())

It would be OK if the system default zone id is UTC.

Supposing (and that should be the case) that the input JSON contains timestamps in UTC in "timestamp" fields.

If the zone id of the host where the code is executed is different from UTC, the converted ZonedDateTime will contain time zone offset and thus the values returned from these private temporal defs will be relative to the zone of the host, but not to the original timestamp zone. This may bring surprising results, especially difficult to debug. So I'd suggest replacing ZoneId.systemDefault() with ZoneId.of("UTC").

Regular Expressions

def byWord contains weird regular expression args to .replaceAll calls: no need to use groups, | or too many escape chars:

.replaceAll("'", "")
.replaceAll("[.?!\\\\;]", " ")

(ugly escapes only for the ugly backslashes).

Strings Processing

The same def byWord might produce a very huge string:

a.messages
 .flatMap(m => m.content)
 .mkString(" ") // <-- here!
 .toLowerCase

This will be a single string representing a concatenation of all the content strings in all the messages. And it will be split by space anyway further in the chaining. I think that this sequence will be more efficient if the replacements were made on each content string, then split into separate words. This could look like:

private def extractWords(content: Option[String]): Seq[String] =
 content.map(c =>
 c.replaceAll("'", "")
 .replaceAll("[.?!\\\\;]", " ")
 .toLowerCase
 .split(" ").toSeq
 ).getOrElse(Nil)
def byWord(a: Messages): Seq[(String, Int)] = {
 a.flatMap(m => extractWords(m.content))
 .groupBy(w => w)
 .map(e => e._1 -> e._2.length)
 .toSeq
 .sortWith(_._2 > _._2)
}

Repetitive Code

In ApplicationMain, these calls are repetitive:

 val words = byWord(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(words, "by_word.csv")

They can be generalized into a function that writes into CSV:

private def writeAsCsv[T](messages: Messages,
 analyticsFunction: Messages => Seq[(T, Int)],
 fileName: String): Unit = {
 val contents = analyticsFunction(messages).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(contents, fileName)
}

And the matcher sequence will be transformed into:

case Right(r) =>
 writeAsCsv(r, byWord, "by_word.csv")
 writeAsCsv(r, byDay, "by_day.csv")
 writeAsCsv(r, byWeekDay, "by_week_day.csv")
 writeAsCsv(r, byYearMonth, "by_year_month.csv")
 writeAsCsv(r, byHourOfDay, "by_hour_of_day.csv")
 // etc

and can even be transformed further into a loop.

`Either` vs `Try`

Finally, a really minor thing.

Although this is a valid usage of matching by Either, I'd remind about a simpler Try matcher:

def parseString2(json: String): Messages = ???
Try(parseString2(jsonString)) match {
 case Success(r) =>
 writeAsCsv(r, byWord, "by_word.csv")
 // etc
 case Failure(ex) => println(ex.getMessage)
}

It looks more suitable here: no specificity or benefits of Either/Left/Right is seen in this code.

Question 4

Hi, thank you very much for the valuable feedback! For the case class and Either vs Try, it's because how I parse the JSON with Circe. See github.com/truongio/messie/blob/…. Maybe there is an easier way. Isn't Either right biased in newer Scala, so Try is just an Either that has throwable on Left? I probably made a mistake on the time zone. It's hard to know. I got the messages from Facebook. My timezone is UTC+2.

Question 5

Apparently the timestamps were in UTC+2, same as my computer. I guess that makes sense considering it's personal data from Facebook and they know where you are from.

Question 6

For advice on Scala (of which I'm no expert) I think Antot's answer does a great job.

However I noticed that your byWord function removes all ' characters which might give results you did not intend, e.g. in the sentence

"The Home Owner's Best Friend is a magazine that gives practical advice for the ambitious home owner"

your function would count "owner" once although it is in the sentence twice. The other occurrence is counted as "owners". This also raises the question, if plurals are counted separately from their singular forms.

Language is complex even if you are only counting words. If this potential error matters you may want to look for a library to do the word counting ...

A quick fix would be to replace ' by space and not count single letters.

Antot Antot 3,7321 gold badge14 silver badges17 bronze badges · Accepted Answer · 2018-06-09 18:00:07Z

First of all, it should be underlined that the code in the question is quite neat, clean, readable and adequate to the task to solve.

However, I have several suggestions, but they are more a kind of polishing the existing code.

`Messages`

I think that case class Messages does not make sense. Its only function is to wrap the collection of Message items, which sometimes can make code more complex.

It neither makes sense from Circe's parsing point of view, because collections are supported out of the box and a direct usage of Seq[Message] should not cause problems.

If there are too many occurrences of Seq[Message] in the code, they can be aliased:

object MessageAnalyticService {
 type Messages = Seq[Message]
 def senderMsgCount(a: Messages, senderName: String): Int = {
 a.count(msg => msg.senderName == senderName)
 }
 def dayMaxMessages(a: Messages): (LocalDate, Int) = {
 a.groupBy(m => toLocalDate(m.timestamp))
 .map(t => t._1 -> t._2.length)
 .toSeq.maxBy(_._2)
 }
 ...
}

And there is an immediate benefit: each function has one line less now, without any decrease in readability!

Temporal Values

One should be careful when using temporal values with time zones. In MessageAnalyticService calls like

ZonedDateTime.ofInstant(ofEpochSecond(epoch), ZoneId.systemDefault())

It would be OK if the system default zone id is UTC.

Supposing (and that should be the case) that the input JSON contains timestamps in UTC in "timestamp" fields.

If the zone id of the host where the code is executed is different from UTC, the converted ZonedDateTime will contain time zone offset and thus the values returned from these private temporal defs will be relative to the zone of the host, but not to the original timestamp zone. This may bring surprising results, especially difficult to debug. So I'd suggest replacing ZoneId.systemDefault() with ZoneId.of("UTC").

Regular Expressions

def byWord contains weird regular expression args to .replaceAll calls: no need to use groups, | or too many escape chars:

.replaceAll("'", "")
.replaceAll("[.?!\\\\;]", " ")

(ugly escapes only for the ugly backslashes).

Strings Processing

The same def byWord might produce a very huge string:

a.messages
 .flatMap(m => m.content)
 .mkString(" ") // <-- here!
 .toLowerCase

This will be a single string representing a concatenation of all the content strings in all the messages. And it will be split by space anyway further in the chaining. I think that this sequence will be more efficient if the replacements were made on each content string, then split into separate words. This could look like:

private def extractWords(content: Option[String]): Seq[String] =
 content.map(c =>
 c.replaceAll("'", "")
 .replaceAll("[.?!\\\\;]", " ")
 .toLowerCase
 .split(" ").toSeq
 ).getOrElse(Nil)
def byWord(a: Messages): Seq[(String, Int)] = {
 a.flatMap(m => extractWords(m.content))
 .groupBy(w => w)
 .map(e => e._1 -> e._2.length)
 .toSeq
 .sortWith(_._2 > _._2)
}

Repetitive Code

In ApplicationMain, these calls are repetitive:

 val words = byWord(r).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(words, "by_word.csv")

They can be generalized into a function that writes into CSV:

private def writeAsCsv[T](messages: Messages,
 analyticsFunction: Messages => Seq[(T, Int)],
 fileName: String): Unit = {
 val contents = analyticsFunction(messages).map(p => s"${p._1}; ${p._2}").mkString("\n")
 CsvWriter.writeToFile(contents, fileName)
}

And the matcher sequence will be transformed into:

case Right(r) =>
 writeAsCsv(r, byWord, "by_word.csv")
 writeAsCsv(r, byDay, "by_day.csv")
 writeAsCsv(r, byWeekDay, "by_week_day.csv")
 writeAsCsv(r, byYearMonth, "by_year_month.csv")
 writeAsCsv(r, byHourOfDay, "by_hour_of_day.csv")
 // etc

and can even be transformed further into a loop.

`Either` vs `Try`

Finally, a really minor thing.

Although this is a valid usage of matching by Either, I'd remind about a simpler Try matcher:

def parseString2(json: String): Messages = ???
Try(parseString2(jsonString)) match {
 case Success(r) =>
 writeAsCsv(r, byWord, "by_word.csv")
 // etc
 case Failure(ex) => println(ex.getMessage)
}

It looks more suitable here: no specificity or benefits of Either/Left/Right is seen in this code.

Hi, thank you very much for the valuable feedback! For the case class and Either vs Try, it's because how I parse the JSON with Circe. See github.com/truongio/messie/blob/…. Maybe there is an easier way. Isn't Either right biased in newer Scala, so Try is just an Either that has throwable on Left? I probably made a mistake on the time zone. It's hard to know. I got the messages from Facebook. My timezone is UTC+2.
Apparently the timestamps were in UTC+2, same as my computer. I guess that makes sense considering it's personal data from Facebook and they know where you are from.

Stack Exchange Network

Program to calculate some stats from Facebook messages

2 Answers 2

`Messages`

Temporal Values

Regular Expressions

Strings Processing

Repetitive Code

`Either` vs `Try`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Program to calculate some stats from Facebook messages

2 Answers 2

Messages

Temporal Values

Regular Expressions

Strings Processing

Repetitive Code

Either vs Try

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`Messages`

`Either` vs `Try`