I have a web application, as part of it I regularly have to send out "overviews" of records in a database, which contain shortened excerpts of a description-text of that record.
What the code is supposed to do:
- Take an HTML string
- Remove the HTML brackets so that only text remains
- cut it down to
SHORT_DESCRIPTION_WORD_COUNT
words - if there are more than
SHORT_DESCRIPTION_WORD_COUNT
words, add "..." at the end
My current implementation looks like this:
import std/[strutils, strformat, htmlparser, xmltree, enumerate]
const SHORT_DESCRIPTION_WORD_COUNT: int = 40
proc truncate*(text: string): string =
let cleanedString = text.parseHtml().innerText
let splitString = cleanedString.split(" ")
if splitString.len() <= SHORT_DESCRIPTION_WORD_COUNT:
result = cleanedString
else:
result.add(splitString[0..SHORT_DESCRIPTION_WORD_COUNT-1].join(" "))
result.add("...")
Looking at some valgrind data, I determined that this proc takes up a sizeable chunk of time and would benefit from being speed-performance optimized a bit.
I'm mostly wondering how to do that though. I believe the intensive steps here are the cleaning up of the HTML and then splitting the string, I'm not sure how to avoid those.
1 Answer 1
After consulting with Yardanico in nim's discord for a bit, there is 1 significant improvement that can be made here and a couple minor ones:
Main Improvement: Use the split iterator
I use the split proc and assign the seq of strings into a variable to iterate over. That is wasteful, when you can use the split iterator instead.
Since you still need to track which word you're at to know when you hit SHORT_DESCRIPTION_WORD_COUNT
, you can use std/enumerate
, which is sugar over writing var i=0
and manually incrementing that within the loop.
proc truncate_breakfor_enumerate*(text: string): string =
let cleanedString = text.parseHtml().innerText
for i, str in enumerate(cleanedString.split(" ")):
result.add(" " & str)
if i >= SHORT_DESCRIPTION_WORD_COUNT:
result.add("...")
break
Minor Improvement: Pre-allocate the string
This matters mostly for scenarios in which you add longer strings. Having the string pre-allocated to roughly the correct size can reduce the number of memory allocations needed overall. That is, because if the string outgrows its size as you add substrings to it, it will need to be copied elsewhere. You can do this with newStringOfCap
.
This will only even start to matter if you start having a fairly sizeable amount of text (>5000 chars) and even then the improvements are somewhat marginal.
proc truncate_breakformemoryalloc*(text: string): string =
let cleanedString = text.parseHtml().innerText
result = newStringOfCap(400) # <-- The important bit
for i, str in enumerate(cleanedString.split(" ")):
result.add(" " & str)
if i >= SHORT_DESCRIPTION_WORD_COUNT:
result.add("...")
break
Running these over a benchmark with one of normal text and another more artificial one with regularly words above 50 characters, these are my results:
name ......................................... min time avg time std dv runs
truncate classic ............................. 0.046 ms 0.053 ms ±0.004 x1000
truncate_breakfor_enumerate................... 0.021 ms 0.023 ms ±0.001 x1000
truncate_breakfor_enumerate_memoryalloc ...... 0.021 ms 0.025 ms ±0.004 x1000
truncate classic long ........................ 0.050 ms 0.082 ms ±0.019 x1000
truncate_breakfor_enumerate long ............. 0.036 ms 0.040 ms ±0.007 x1000
truncate_breakfor_enumerate_memoryalloc long ..0.035 ms 0.038 ms ±0.004 x1000
Both of these versions should do perfectly fine for any purposes.