I wrote this both to help read (and edit/summarize) my own work, but mainly to be able to get through the mounds of papers I was asked to read as a graduate student.
This is a summarization algorithm based on what I learned in first or second grade. I tried to simply get the "main ideas" of each paragraph by taking the first sentence -- now, I have updated the program to allow for first two sentences. I have also modified the program because I didn't want certain words to be seen as sentences like Theorem/Definition/Corollary etc. I am not sure how well it parses text copied from a PDF compared to for instance flat text.
Now that I'm actually sharing this, I'm thinking about things such as comments in Latex that I would want the summarizer to ignore, as well as mathtype between \[
and \]
.
<pre>
<!DOCTYPE html>
<html>
<head>
<title>Simple Text Summarizer with PDF Support</title>
<meta charset="UTF-8">
<meta name="description" content="A simple web-based summarizer that extracts the first sentence from each paragraph. Includes special handling for text copied from PDF files.">
<meta name="keywords" content="text summarizer, PDF summarizer, paragraph summarizer, first sentence summary, JavaScript tool, text analysis">
<meta name="author" content="Your Name">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<p id="intro">
This is a naive way of summarizing large blocks of text. It is based on the elementary school idea that the main idea of a text can be found in the first sentence of the first paragraph. So given a long text, it will output the first sentence of each paragraph as the summary.
</p>
<textarea name="indata" id="indata" rows="20" cols="50"></textarea><br>
<label>
<input type="checkbox" id="isPDF"> PDF-style input (lines joined into paragraphs)
</label><br>
<label for="sentenceCount">Sentences per paragraph:</label>
<select id="sentenceCount">
<option value="1">1</option>
<option value="2" selected>2</option>
</select><br>
<input type="button" onclick="summarize(document.getElementById('indata').value)" value="Summarize"><br><br>
<p name="outdata" id="outdata"></p>
</body>
</html>
</pre>
function summarize(text) {
const isPDF = document.getElementById("isPDF").checked;
const sentenceCount = parseInt(document.getElementById("sentenceCount").value);
let paras;
if (isPDF) {
const lines = text.split('\n');
paras = [];
let paragraph = '';
for (let i = 0; i < lines.length; i++) {
if (lines[i].trim() === '') {
if (paragraph.trim().length > 0) {
paras.push(paragraph.trim());
paragraph = '';
}
} else {
paragraph += lines[i].trim() + ' ';
}
}
if (paragraph.trim().length > 0) {
paras.push(paragraph.trim());
}
} else {
paras = text.split('\n');
}
const termRegex = /\b(lemma|theorem|conjecture|proposition|remark)\s*\d*\.?$/i;
let out = `<b>Summary (${sentenceCount} sentence${sentenceCount > 1 ? "s" : ""} per paragraph)</b><br><br>`;
for (let i = 0; i < paras.length; i++) {
if (paras[i].length > 1) {
let rawSentences = paras[i].split(/(?<=\.)\s+/);
let sentences = [];
for (let j = 0; j < rawSentences.length; j++) {
let current = rawSentences[j].trim();
if (termRegex.test(current) && j + 1 < rawSentences.length) {
// Merge with next sentence if it ends in formal term
current += ' ' + rawSentences[j + 1].trim();
j++; // Skip the next sentence
}
sentences.push(current);
}
let snippet = sentences.slice(0, sentenceCount).join(' ');
out += snippet + "<br><br>";
}
}
document.getElementById("outdata").innerHTML = out;
}
body {
font-family: sans-serif;
margin: 20px;
}
textarea {
width: 80%;
height: 150px;
margin-bottom: 10px;
padding: 10px;
}
button {
padding: 10px 20px;
background-color: #007bff;
color: white;
border: none;
cursor: pointer;
}
.outdata {
margin-top: 20px;
border: 1px solid #ccc;
padding: 10px;
}
1 Answer 1
Best tool for the job
First of all: use the best tool for the job. Running this in a browser may not be the best idea if the files can grow pretty large. It is much better to stream the files from the local filesystem. If you want you can still e.g. use Node.JS. From what I've seen from your code, it is an embarrassingly linear process, so streaming should be fine.
Then you can just read in line by line (with a different line extractor for each file type, if you haven't "normalized" it to plaintext), check if the sentences suit your requirements, and then add it to your summary file.
I'll however assume you've already got the text
loaded.
The copy to memory issue
Lines
const lines = text.split('\n');
This makes me somewhat angry, as this will copy the entire file into memory. It is already bad enough that text
contains everything, but there is definitely no need to copy the entire text to go from line to line.
For instance you could do:
function* linesFromText(text) {
const newlinePattern = /([^\r\n]*)(?:\r?\n|$)/gy;
let match;
while ((match = newlinePattern.exec(text))
&& (match[0].length || newlinePattern.lastIndex < text.length)) {
yield match[1];
}
}
// usage
for (const line of linesFromText(text)) {
// process each line
}
However, after that you still put all the text from the PDF in paragraphs, so that won't work.
I'd recommend another way of doing this, defining an interface which you can then implement for each content type:
class ParagraphSentenceRetriever {
/** @param {string[]} out */ // mutates `out`
async retrieveSentencesFromNextParagraph(out) {
throw new Error('abstract');
}
}
Sentences
Of course, just with the lines, I'd prefer searching rather than splitting through a potentially large file. You first find the location of the line, then start looking for sentences within that line, using indices in the text. Only once you want to create your summary does it make sense to copy.
Smaller remarks
const sentenceCount = parseInt(document.getElementById("sentenceCount").value);
Other developers - including your future self - don't know the purpose of your application. So "SentenceCount" immediately puts me on the wrong foot. The developer has to know that these are the requested number of sentences.
let rawSentences = paras[i].split(/(?<=\.)\s+/);
I'd always prefix a comment for my regular expressions that tell what they do. As a reader you're otherwise only left with how they do it, e.g.
// splits sentences by a dot followed by at least one whitespace character.
Even though the "term" regex is somewhat more intuitive, I'd still comment it as it may get more complex.
const termRegex = /\b(lemma|theorem|conjecture|proposition|remark)\s*\d*\.?$/i;
if (paras[i].length > 1) {
Always avoid getting too many scopes in your functions. Here you could simply have
if (paras[i].length == 0) {
continue;
}
Of course, if you'd search for lines / paragraphs and sentences this issue would probably not pop up.
let snippet = sentences.slice(0, sentenceCount).join(' ');
Let's not. This creates another copy during slice
. In this case a simple loop would be sufficient. If you'd use the method previously indicated you could just build the snippet after searching and finding the lines of course.
let snippet = '';
for (let i = 0; i < sentenceCount && i < sentences.length; i++) {
if (i > 0) snippet += ' ';
snippet += sentences[i];
}
const isPDF = document.getElementById("isPDF").checked;
Don't mix UI and functionality. You may want to be able to test your functionality without having to go through the UI.
-
\$\begingroup\$ "
const lines = text.split('\n');
will copy the entire file into memory." - uh, no. Thetext
string already is in memory. Splitting it into slices is hardly a problem. Your iterator version won't perform much better if the entire text is processed anyway. \$\endgroup\$Bergi– Bergi2025年08月13日 12:03:23 +00:00Commented Aug 13 at 12:03 -
\$\begingroup\$ @Bergi That's an engine optimization, it's not in the contract of string or the method. Safari seems to have (had?) some issues with substring handling. Copying it into paragraphs later on certainly will create a copy though, as it also changes the layout by inserting spaces etc. But OK, yeah, the engines in major browsers to keep references & indices into the original string. \$\endgroup\$Maarten Bodewes– Maarten Bodewes2025年08月13日 16:04:49 +00:00Commented Aug 13 at 16:04
-
\$\begingroup\$ "For instance you could do:" So you've substituted
exec()
forsplit()
? No change. You can just use iterator helpers and operate on the original string. \$\endgroup\$guest271314– guest2713142025年08月31日 18:35:36 +00:00Commented Aug 31 at 18:35
<html>
inside<pre>
? That's a similar "mistake" as in your previous question: codereview.stackexchange.com/q/297868/36647 \$\endgroup\$pandoc
, which can take about any format and return plain text (or about anything else). \$\endgroup\$