1
\$\begingroup\$

I needed a super simple parser to get HTML attributes and their values. I didn’t want to load a big library or anything so I made this.

I realize I am making assumptions here, mainly:

  • Attribute values will be surrounded with either single or double quotes.
  • The input string will end with > or />.

Besides those assumptions are there any other glaring issues here?

Dictionary<string, string> HTMLAttributes(string tag)
{
 Dictionary<string, string> attr = new Dictionary<string, string>();
 MatchCollection matches = 
 Regex.Matches(tag, @"([^\t\n\f \/>""'=]+)(?:=)('.*?'|"".*?"")(?:\s|\/>|\>)");
 foreach (Match match in matches)
 {
 attr.Add(match.Groups[1].Value,
 match.Groups[2].Value.Substring(1, match.Groups[2].Value.Length - 2)
 );
 }
 return attr;
}

Running:

HTMLAttributes("<body class=\" something \" hello='world' />");

Returns:

{
 class: " something ",
 hello: "world"
}

Sample:

http://ideone.com/ZBaBhB

unor
2,67315 silver badges24 bronze badges
asked Apr 22, 2014 at 15:27
\$\endgroup\$
7
  • 1
    \$\begingroup\$ "I didn't want to load a big library" Are you sure that's more work than an unreadable line-long regex that you're not sure actually works correctly? \$\endgroup\$ Commented Apr 22, 2014 at 15:51
  • \$\begingroup\$ My concern wasn't my work load but my web app having to load a big library for 1 simple operation. \$\endgroup\$ Commented Apr 22, 2014 at 16:00
  • 2
    \$\begingroup\$ It's not if that function is something that's actually quite complicated like HTML parsing. You can't parse HTML with regex. \$\endgroup\$ Commented Apr 22, 2014 at 16:09
  • 1
    \$\begingroup\$ If that's such a big concern you should probably be looking at something like c++ instead of any of the .net languages. \$\endgroup\$ Commented Apr 22, 2014 at 16:11
  • 1
    \$\begingroup\$ Maintaining libraries is often more difficult than maintaining a small piece of code. When the platform changes, and the library has not been updated, you are stuck. The argument can go the other way as well. When parsing requirements change, one can simply update the library and be done. In my experience, the problems happen more than the benefits, so I agree with OP in this case, but for maintenance reasons, not performance reasons. \$\endgroup\$ Commented Apr 27, 2015 at 14:14

1 Answer 1

4
\$\begingroup\$
  1. Whitespace around = is allowed, your regex won't handle that.
  2. Characters in HTML can be encoded, some characters (like &) have to be encoded. For example name="AT&amp;T" should return that the value of name is AT&T.
answered Apr 22, 2014 at 16:00
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.