\$\begingroup\$
\$\endgroup\$
7
I needed a super simple parser to get HTML attributes and their values. I didn’t want to load a big library or anything so I made this.
I realize I am making assumptions here, mainly:
- Attribute values will be surrounded with either single or double quotes.
- The input string will end with
>
or/>
.
Besides those assumptions are there any other glaring issues here?
Dictionary<string, string> HTMLAttributes(string tag)
{
Dictionary<string, string> attr = new Dictionary<string, string>();
MatchCollection matches =
Regex.Matches(tag, @"([^\t\n\f \/>""'=]+)(?:=)('.*?'|"".*?"")(?:\s|\/>|\>)");
foreach (Match match in matches)
{
attr.Add(match.Groups[1].Value,
match.Groups[2].Value.Substring(1, match.Groups[2].Value.Length - 2)
);
}
return attr;
}
Running:
HTMLAttributes("<body class=\" something \" hello='world' />");
Returns:
{
class: " something ",
hello: "world"
}
Sample:
asked Apr 22, 2014 at 15:27
-
1\$\begingroup\$ "I didn't want to load a big library" Are you sure that's more work than an unreadable line-long regex that you're not sure actually works correctly? \$\endgroup\$svick– svick2014年04月22日 15:51:09 +00:00Commented Apr 22, 2014 at 15:51
-
\$\begingroup\$ My concern wasn't my work load but my web app having to load a big library for 1 simple operation. \$\endgroup\$iambriansreed– iambriansreed2014年04月22日 16:00:54 +00:00Commented Apr 22, 2014 at 16:00
-
2\$\begingroup\$ It's not if that function is something that's actually quite complicated like HTML parsing. You can't parse HTML with regex. \$\endgroup\$svick– svick2014年04月22日 16:09:43 +00:00Commented Apr 22, 2014 at 16:09
-
1\$\begingroup\$ If that's such a big concern you should probably be looking at something like c++ instead of any of the .net languages. \$\endgroup\$user33306– user333062014年04月22日 16:11:28 +00:00Commented Apr 22, 2014 at 16:11
-
1\$\begingroup\$ Maintaining libraries is often more difficult than maintaining a small piece of code. When the platform changes, and the library has not been updated, you are stuck. The argument can go the other way as well. When parsing requirements change, one can simply update the library and be done. In my experience, the problems happen more than the benefits, so I agree with OP in this case, but for maintenance reasons, not performance reasons. \$\endgroup\$rocketsarefast– rocketsarefast2015年04月27日 14:14:38 +00:00Commented Apr 27, 2015 at 14:14
1 Answer 1
\$\begingroup\$
\$\endgroup\$
- Whitespace around
=
is allowed, your regex won't handle that. - Characters in HTML can be encoded, some characters (like
&
) have to be encoded. For examplename="AT&T"
should return that the value ofname
isAT&T
.
answered Apr 22, 2014 at 16:00
lang-cs