Is this a safe way to parse out HTML tag attributes?

Asked 11 years, 5 months ago

Viewed 4k times

\$\begingroup\$

I needed a super simple parser to get HTML attributes and their values. I didn’t want to load a big library or anything so I made this.

I realize I am making assumptions here, mainly:

Attribute values will be surrounded with either single or double quotes.
The input string will end with > or />.

Besides those assumptions are there any other glaring issues here?

Dictionary<string, string> HTMLAttributes(string tag)
{
 Dictionary<string, string> attr = new Dictionary<string, string>();
 MatchCollection matches = 
 Regex.Matches(tag, @"([^\t\n\f \/>""'=]+)(?:=)('.*?'|"".*?"")(?:\s|\/>|\>)");
 foreach (Match match in matches)
 {
 attr.Add(match.Groups[1].Value,
 match.Groups[2].Value.Substring(1, match.Groups[2].Value.Length - 2)
 );
 }
 return attr;
}

Running:

HTMLAttributes("<body class=\" something \" hello='world' />");

Returns:

{
 class: " something ",
 hello: "world"
}

Sample:

http://ideone.com/ZBaBhB

edited Apr 24, 2014 at 11:35

unor's user avatar

unor

2,67315 silver badges24 bronze badges

asked Apr 22, 2014 at 15:27

iambriansreed's user avatar

iambriansreed iambriansreed

2493 silver badges11 bronze badges

\$\endgroup\$

1

\$\begingroup\$ "I didn't want to load a big library" Are you sure that's more work than an unreadable line-long regex that you're not sure actually works correctly? \$\endgroup\$

svick
– svick

2014年04月22日 15:51:09 +00:00
Commented Apr 22, 2014 at 15:51
\$\begingroup\$ My concern wasn't my work load but my web app having to load a big library for 1 simple operation. \$\endgroup\$

iambriansreed
– iambriansreed

2014年04月22日 16:00:54 +00:00
Commented Apr 22, 2014 at 16:00
2

\$\begingroup\$ It's not if that function is something that's actually quite complicated like HTML parsing. You can't parse HTML with regex. \$\endgroup\$

svick
– svick

2014年04月22日 16:09:43 +00:00
Commented Apr 22, 2014 at 16:09
1

\$\begingroup\$ If that's such a big concern you should probably be looking at something like c++ instead of any of the .net languages. \$\endgroup\$

user33306
– user33306

2014年04月22日 16:11:28 +00:00
Commented Apr 22, 2014 at 16:11
1

\$\begingroup\$ Maintaining libraries is often more difficult than maintaining a small piece of code. When the platform changes, and the library has not been updated, you are stuck. The argument can go the other way as well. When parsing requirements change, one can simply update the library and be done. In my experience, the problems happen more than the benefits, so I agree with OP in this case, but for maintenance reasons, not performance reasons. \$\endgroup\$

rocketsarefast
– rocketsarefast

2015年04月27日 14:14:38 +00:00
Commented Apr 27, 2015 at 14:14

| Show 2 more comments

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Whitespace around = is allowed, your regex won't handle that.
Characters in HTML can be encoded, some characters (like &) have to be encoded. For example name="AT&T" should return that the value of name is AT&T.

answered Apr 22, 2014 at 16:00

svick's user avatar

svick svick

24.5k4 gold badges53 silver badges89 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-cs

Stack Exchange Network

Is this a safe way to parse out HTML tag attributes?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Is this a safe way to parse out HTML tag attributes?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions