I have a byte array in a C# program.
I want to determine as quickly as possible if the content is Xml. If it's the case, I also want to return the Xml itself.
By now I have this method :
protected bool TryGetXElement(byte[] body, out XElement el)
{
el = null;
// if there is no data, this is not xml :)
if (body == null || body.Length == 0)
{
return false;
}
try
{
// Load the data into a memory stream
using (var ms = new MemoryStream(body))
{
using (var sr = new StreamReader(ms))
{
// if the first character is not <, this can't be xml
var firstChar = (char)sr.Peek();
if (firstChar != '<')
{
return false;
}
else
{
// ultimately, we try to parse the XML
el = XElement.Load(sr);
return true;
}
}
}
}
catch
{
// if it failed, we suppose this is not XML
return false;
}
}
Is there potential improvement?
2 Answers 2
Why not verify if a root node exists rather than check for a '<' ?
public bool GetRootNode(XmlReader reader)
{
bool isValid;
try
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
isValid = true;
break;
}
}
}
catch (XmlException x)
{
throw new XmlException(x.Message);
}
return isValid;
}
You could also use the chain of responsibilities pattern to determine which parser you want to use.
EDIT
Not sure if this will work, but here is an example on using the above code:
protected bool TryGetXElement(byte[] body, out XElement el)
{
el = null;
// if there is no data, this is not xml :)
if (body == null || body.Length == 0)
{
return false;
}
try
{
// Load the data into a memory stream
using (var ms = new MemoryStream(body))
{
using (var sr = new StreamReader(ms))
{
XmlReaderSettings settings = XmlReaderSettings { CheckCharacters = true; };
using(XmlReader reader = XmlReader.Create(ms, settings))
{
if (!GetRootNode(reader))
{
return false;
}
else
{
// ultimately, we try to parse the XML
el = XElement.Load(sr);
return true;
}
}
}
}
}
catch
{
// if it failed, we suppose this is not XML
return false;
}
}
-
\$\begingroup\$ I cannot see why this would be an enhancement. Your code rely on exception catching (so I do ultimately), while mine check one char. I believe checking a single char is very cheap compared to catching an exception, and can reduce a great number of cases (the probability of having a < char in a non xml string is very low). \$\endgroup\$Steve B– Steve B2012年08月27日 16:37:11 +00:00Commented Aug 27, 2012 at 16:37
-
\$\begingroup\$ You could have a '<' in a string of html. That is not xml. Does your xml have to fit a particular schema? How are you handling invalid characters that cannot appear in an xml document? Do you have to handle those cases? \$\endgroup\$NexAddo– NexAddo2012年08月27日 16:43:00 +00:00Commented Aug 27, 2012 at 16:43
-
\$\begingroup\$ I don't have schema in the sense of XSD files. I have behind the scene a bunch of "parsers" that will analyse the returned XElement. Each parser can handle a well known xml structure. And I also have parsers for non-xml document, but it's another story \$\endgroup\$Steve B– Steve B2012年08月27日 16:47:05 +00:00Commented Aug 27, 2012 at 16:47
Just a few small readability notes:
(削除)(body == null || body.Length == 0)
could beString.IsNullOrEmpty(body)
(削除ここまで)Longer variable names than
sr
,el
andms
could be a little bit easier to read. "Without proper names, we are constantly decoding and reconstructing the information that should be apparent from reading the code alone." (From codesparkle's former answer.)The innermost else could be omitted:
if (firstChar != '<') { return false; } // ultimately, we try to parse the XML el = XElement.Load(sr); return true;
-
\$\begingroup\$ 1. no.... body is a byte array not a string. 2. I agree with you at some point. I think I should rename
el
tofoundXElement
or something like this. Other very local variables likems
orsr
can be stay like this. It's a consistent naming convention I use in very small methods. As always, naming convention is subject to developer's feelings. 3. Technically yes, it can be omitted. But I think it's more readable when the two path are visually on the same level. A matter of preference again. Thanks for your feeback anyways. \$\endgroup\$Steve B– Steve B2012年09月27日 07:51:25 +00:00Commented Sep 27, 2012 at 7:51
<
. Moreover, as it's computer XML, i'm confident it wiki never starts with a sppace \$\endgroup\$