lua-users home
lua-l archive

Re: Stripping HTML tags

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Florian Berger wrote:
...
Chris Marring wrote:
 > You could just use luaexpat and then extract out what you need. This
 > is especially easy with the Lua Object Model feature, which simply
 > returns the HTML as a hierarchy of tables. Expat is very good at
> grokking all the twisty bits of HTML, so this could help get past all > that...
How well does LuaExpat work if HTML is not clean or valid?
My experience with expat (NOT used with LuaExpat) is that it makes a valiant effort to deal with a few things. But for the most part, invalid HTML generates an error and aborts. I think there is a way to get expat to continue if there is a validity error. For instance, I think you can get it to handle the case where a an EndElement has the wrong name. But for the most part invalid HTML, like invalid C, is hard to fix and make any sense of. And I don't know how easy it would be to get LuaExpat to be tolerant of errors.
My general rule is "always use valid HTML" :-)
--
chris marrin ,"",ドル "As a general rule,don't solve puzzles
chris@marrin.com b` $ that open portals to Hell" ,,.
 ,.` ,b` ,` , 1$'
 ,|` mP ,` :$$' ,mm
 ,b" b" ,` ,mm m$$ ,m ,`P$$
 m$` ,b` .` ,mm ,'|$P ,|"1$` ,b$P ,` :1ドル
 b$` ,$: :,`` |$$ ,` $$` ,|` ,$,,ドル`"$$ .` :$|
b$| _m$`,:` :1ドル ,` ,$Pm|` ` :$,ドル..;"' |$:
P$b, _;b$$b1ドル" |$$ ,` ,$$" ``' $$
 ```"```'" `"` `""` ""` ,P`

AltStyle によって変換されたページ (->オリジナル) /