How to speed up truncating text to specific word count in nim?

Question 1

I have a web application, as part of it I regularly have to send out "overviews" of records in a database, which contain shortened excerpts of a description-text of that record.

What the code is supposed to do:

Take an HTML string
Remove the HTML brackets so that only text remains
cut it down to SHORT_DESCRIPTION_WORD_COUNT words
if there are more than SHORT_DESCRIPTION_WORD_COUNT words, add "..." at the end

My current implementation looks like this:

import std/[strutils, strformat, htmlparser, xmltree, enumerate]
const SHORT_DESCRIPTION_WORD_COUNT: int = 40
proc truncate*(text: string): string =
 let cleanedString = text.parseHtml().innerText
 let splitString = cleanedString.split(" ")
 
 if splitString.len() <= SHORT_DESCRIPTION_WORD_COUNT:
 result = cleanedString
 else:
 result.add(splitString[0..SHORT_DESCRIPTION_WORD_COUNT-1].join(" "))
 result.add("...")

Looking at some valgrind data, I determined that this proc takes up a sizeable chunk of time and would benefit from being speed-performance optimized a bit.

I'm mostly wondering how to do that though. I believe the intensive steps here are the cleaning up of the HTML and then splitting the string, I'm not sure how to avoid those.

Question 2

After consulting with Yardanico in nim's discord for a bit, there is 1 significant improvement that can be made here and a couple minor ones:

Main Improvement: Use the split iterator

I use the split proc and assign the seq of strings into a variable to iterate over. That is wasteful, when you can use the split iterator instead.

Since you still need to track which word you're at to know when you hit SHORT_DESCRIPTION_WORD_COUNT, you can use std/enumerate, which is sugar over writing var i=0 and manually incrementing that within the loop.

proc truncate_breakfor_enumerate*(text: string): string =
 let cleanedString = text.parseHtml().innerText
 for i, str in enumerate(cleanedString.split(" ")):
 result.add(" " & str)
 if i >= SHORT_DESCRIPTION_WORD_COUNT:
 result.add("...")
 break

Minor Improvement: Pre-allocate the string

This matters mostly for scenarios in which you add longer strings. Having the string pre-allocated to roughly the correct size can reduce the number of memory allocations needed overall. That is, because if the string outgrows its size as you add substrings to it, it will need to be copied elsewhere. You can do this with newStringOfCap.

This will only even start to matter if you start having a fairly sizeable amount of text (>5000 chars) and even then the improvements are somewhat marginal.

proc truncate_breakformemoryalloc*(text: string): string =
 let cleanedString = text.parseHtml().innerText
 result = newStringOfCap(400) # <-- The important bit
 for i, str in enumerate(cleanedString.split(" ")):
 result.add(" " & str)
 if i >= SHORT_DESCRIPTION_WORD_COUNT:
 result.add("...")
 break

Running these over a benchmark with one of normal text and another more artificial one with regularly words above 50 characters, these are my results:

name ......................................... min time avg time std dv runs
truncate classic ............................. 0.046 ms 0.053 ms ±0.004 x1000
truncate_breakfor_enumerate................... 0.021 ms 0.023 ms ±0.001 x1000
truncate_breakfor_enumerate_memoryalloc ...... 0.021 ms 0.025 ms ±0.004 x1000
truncate classic long ........................ 0.050 ms 0.082 ms ±0.019 x1000
truncate_breakfor_enumerate long ............. 0.036 ms 0.040 ms ±0.007 x1000
truncate_breakfor_enumerate_memoryalloc long ..0.035 ms 0.038 ms ±0.004 x1000

Both of these versions should do perfectly fine for any purposes.

Philipp Doerner Philipp Doerner 5455 silver badges19 bronze badges · Answer 1 · 2022-07-02 15:23:53Z

After consulting with Yardanico in nim's discord for a bit, there is 1 significant improvement that can be made here and a couple minor ones:

Main Improvement: Use the split iterator

I use the split proc and assign the seq of strings into a variable to iterate over. That is wasteful, when you can use the split iterator instead.

Since you still need to track which word you're at to know when you hit SHORT_DESCRIPTION_WORD_COUNT, you can use std/enumerate, which is sugar over writing var i=0 and manually incrementing that within the loop.

proc truncate_breakfor_enumerate*(text: string): string =
 let cleanedString = text.parseHtml().innerText
 for i, str in enumerate(cleanedString.split(" ")):
 result.add(" " & str)
 if i >= SHORT_DESCRIPTION_WORD_COUNT:
 result.add("...")
 break

Minor Improvement: Pre-allocate the string

This matters mostly for scenarios in which you add longer strings. Having the string pre-allocated to roughly the correct size can reduce the number of memory allocations needed overall. That is, because if the string outgrows its size as you add substrings to it, it will need to be copied elsewhere. You can do this with newStringOfCap.

This will only even start to matter if you start having a fairly sizeable amount of text (>5000 chars) and even then the improvements are somewhat marginal.

proc truncate_breakformemoryalloc*(text: string): string =
 let cleanedString = text.parseHtml().innerText
 result = newStringOfCap(400) # <-- The important bit
 for i, str in enumerate(cleanedString.split(" ")):
 result.add(" " & str)
 if i >= SHORT_DESCRIPTION_WORD_COUNT:
 result.add("...")
 break

Running these over a benchmark with one of normal text and another more artificial one with regularly words above 50 characters, these are my results:

name ......................................... min time avg time std dv runs
truncate classic ............................. 0.046 ms 0.053 ms ±0.004 x1000
truncate_breakfor_enumerate................... 0.021 ms 0.023 ms ±0.001 x1000
truncate_breakfor_enumerate_memoryalloc ...... 0.021 ms 0.025 ms ±0.004 x1000
truncate classic long ........................ 0.050 ms 0.082 ms ±0.019 x1000
truncate_breakfor_enumerate long ............. 0.036 ms 0.040 ms ±0.007 x1000
truncate_breakfor_enumerate_memoryalloc long ..0.035 ms 0.038 ms ±0.004 x1000

Both of these versions should do perfectly fine for any purposes.

Stack Exchange Network

How to speed up truncating text to specific word count in nim?

1 Answer 1

Main Improvement: Use the split iterator

Minor Improvement: Pre-allocate the string

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to speed up truncating text to specific word count in nim?

1 Answer 1

Main Improvement: Use the split iterator

Minor Improvement: Pre-allocate the string

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions