0

I want to replace utf8 html entities in html sources with real characters. I have the "entities" replacement table which is traversed with code bellow. If I run this code it utilizes my CPU up to 100%.

Please could you help me how to rewrite first loop in better way? I understand that in Lua strings are immutable so I think there are many copies of data variable and this could be the reason.

local entities = {
 {["char"]="!", ["utf"]="!"},
 {["char"]='"', ["utf"]="""},
 {["char"]="#", ["utf"]="#"},
 {["char"]="$", ["utf"]="$"},
 {["char"]="%", ["utf"]="%"},
 {["char"]="&", ["utf"]="&"},
 {["char"]="'", ["utf"]="'"},
 -- +312 rows more
} 
local function clear_text(data)
 for _, e in ipairs(entities) do
 data = string.gsub(data, e.utf, e.char)
 end
 return data
end
-- this is just for testing ... replacement in many html sources
for i=1,200 do
 local data = some_html_page_source()
 clear_text(data)
end
asked Jul 29, 2016 at 7:25

3 Answers 3

0

EDIT: misread question, so rewrote it with the same principle.

According to this answer, you can use str:gsub(pattern, function) to perform a custom replacement on all matches of pattern inside str.

The pattern &#.+; should match all utf characters, calling function for each of the matches.

All that is left to do in the callback function is to find the matching human-readable char, and returning that as the replacing value. To this end, it would be faster if entities was keyed by the utf strings, with their respective char as value, so you don't have to iterate entities every time.

another edit: according to the lua documentation on gsub, the second parameter can be a table. In that case, the lookup is done automatically and it will attempt to use each match as the key, replacing it with the value from that table. That would be the cleanest solution once you restructure entities

answered Jul 29, 2016 at 7:35
Sign up to request clarification or add additional context in comments.

3 Comments

But that's backwards. You want the utf value to be the key. And those take multiple characters.
Indeed. I completely misread the question. This does make it trickier, but you could still use the same principle.
Give me a moment to read up on lua's pattern matching, since it is not standard regex.
0
-- Lua 5.3 required
local html_entities = {
 nbsp = " ",
 lt = "<",
 gt = ">",
 amp = "&",
 euro = "€",
 copy = "©",
 Gamma = "Γ",
 Delta = "Δ",
 prod = "∏",
 sum = "∑",
 forall = "∀",
 exist = "∃",
 empty = "∅",
 nabla = "∇",
 isin = "∈",
 notin = "∉",
 -- + many more rows
}
local str = [[&exist; &euro; &empty; &Delta; &#8364; &#x20AC;]]
str = str:gsub("&(#?)(.-);",
 function(prefix, name)
 if prefix ~= "" then
 return utf8.char(tonumber("0"..name))
 else
 return html_entities[name]
 end
 end
)
print(str)
answered Jul 29, 2016 at 7:55

Comments

0

There's another way of replacing the sequence of characters.

local function clear_text(data)
 return (string.gsub(
 data,
 [=[[!"#$%&']]=], -- all your entries goes here, between [=[ and ]=]
 function(c)
 return "&#" .. string.byte(c) .. ";" -- replace with char code
 end
 ))
end
-- this is just for testing ... replacement in many html sources
for i=1,200 do
 local data = "!#!#!#!#!#!";
 print(clear_text(data))
end
answered Jul 30, 2016 at 14:41

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.