How to convert HTML to plain text with Jsoup (Scala and Java)

By Alvin Alexander. Last updated: April 22, 2024

If you ever need to convert HTML to plain text using Scala or Java, I hope these Jsoup examples are helpful:

import org.jsoup.Jsoup
import org.jsoup.nodes.{Document, Element}
object JsoupHtmlToPlainTextTest extends App {
 val html =
 """
 |<html>
 | <head><title>Hello, world</title></head>
 | <body>
 | <h1>Hello, world</h1>
 | <p>Hello, world.</p>
 | <p>This is a test.</p>
 | </body>
 |</html>
 """.stripMargin
 // Example 1: this works, but all output is on one line
 val doc: Document = Jsoup.parse(html)
 //val s: String = doc.text() //include <head> and <body> text
 val s: String = doc.body.text() //<body> text only
 //println(s)
 // Example 2: this works, output is on multiple lines
 val formatter = new JsoupFormatter
 val plainText = formatter.getPlainText(doc)
 //println(plainText)
 // Example 3: this works as a way to select the <body> only
 val body: String = doc.select("body").first.text()
 //println(body)
 // Example 4: works: gets text from paragraphs only
 // https://jsoup.org/cookbook/input/parse-body-fragment
 val doc4 = Jsoup.parseBodyFragment(html)
 val body4 = doc4.body()
 val paragraphs = body4.getElementsByTag("p")
 import scala.collection.JavaConverters._
 val scalaParagraphs = asScalaBuffer(paragraphs)
 for (paragraph <- scalaParagraphs) {
 println(paragraph.text)
 }
}

While this is just some test code that I’m currently working on to understand Jsoup, the code shows four different ways to convert the given HTML into plain text. Hopefully the comments explain how the HTML to plain text conversion processes work, so I won’t write more about them. I just wanted to share this code snippet here today a) so I can find it again, and b) in hopes it might help others that need to convert HTML to text using Jsoup.

alvinalexander.com
is owned and operated by
Valley Programming, LLC

In regards to links to Amazon.com, As an Amazon Associate
I (Valley Programming, LLC) earn from qualifying purchases

This website uses cookies: learn more

AltStyle によって変換されたページ (->オリジナル) /