Checking and returning Xml from a byte array

Question 1

I have a byte array in a C# program.

I want to determine as quickly as possible if the content is Xml. If it's the case, I also want to return the Xml itself.

By now I have this method :

 protected bool TryGetXElement(byte[] body, out XElement el)
 {
 el = null;
 // if there is no data, this is not xml :)
 if (body == null || body.Length == 0)
 {
 return false;
 }
 try
 {
 // Load the data into a memory stream
 using (var ms = new MemoryStream(body))
 {
 using (var sr = new StreamReader(ms))
 {
 // if the first character is not <, this can't be xml
 var firstChar = (char)sr.Peek();
 if (firstChar != '<')
 {
 return false;
 }
 else
 {
 // ultimately, we try to parse the XML
 el = XElement.Load(sr);
 return true;
 }
 }
 }
 }
 catch
 {
 // if it failed, we suppose this is not XML
 return false;
 }
 }

Is there potential improvement?

Question 2

Why do you need to do this? Can't you figure out whether it's XML some other way?

Question 3

The bytes are coming from a message streams (that I cannot control). I have to route the messages to either of my parsers, depending of the kind of data. Messages are sometimes binaries, sometimes xml.

Question 4

Does the structure of the messages you are receiving have any metadata that could clue you in as to their contents (e.g., a MIME header)?

Question 5

Is there the possibility of having the xml start with a space?

Question 6

@minibill: the XML standard states it must starts with a <. Moreover, as it's computer XML, i'm confident it wiki never starts with a sppace

Question 7

Why not verify if a root node exists rather than check for a '<' ?

 public bool GetRootNode(XmlReader reader)
 {
 bool isValid;
 try
 {
 while (reader.Read())
 {
 if (reader.NodeType == XmlNodeType.Element)
 {
 isValid = true;
 break;
 }
 }
 }
 catch (XmlException x)
 {
 throw new XmlException(x.Message);
 }
 return isValid;
 }

You could also use the chain of responsibilities pattern to determine which parser you want to use.

EDIT

Not sure if this will work, but here is an example on using the above code:

protected bool TryGetXElement(byte[] body, out XElement el)
{
 el = null;
 // if there is no data, this is not xml :)
 if (body == null || body.Length == 0)
 {
 return false;
 }
 try
 {
 // Load the data into a memory stream
 using (var ms = new MemoryStream(body))
 {
 using (var sr = new StreamReader(ms))
 {
 XmlReaderSettings settings = XmlReaderSettings { CheckCharacters = true; };
 using(XmlReader reader = XmlReader.Create(ms, settings))
 {
 if (!GetRootNode(reader))
 {
 return false;
 }
 else
 {
 // ultimately, we try to parse the XML
 el = XElement.Load(sr);
 return true;
 }
 }
 }
 }
 }
 catch
 {
 // if it failed, we suppose this is not XML
 return false;
 }
}

Question 8

I cannot see why this would be an enhancement. Your code rely on exception catching (so I do ultimately), while mine check one char. I believe checking a single char is very cheap compared to catching an exception, and can reduce a great number of cases (the probability of having a < char in a non xml string is very low).

Question 9

You could have a '<' in a string of html. That is not xml. Does your xml have to fit a particular schema? How are you handling invalid characters that cannot appear in an xml document? Do you have to handle those cases?

Question 10

I don't have schema in the sense of XSD files. I have behind the scene a bunch of "parsers" that will analyse the returned XElement. Each parser can handle a well known xml structure. And I also have parsers for non-xml document, but it's another story

Question 11

Just a few small readability notes:

~~(削除) (body == null || body.Length == 0) could be String.IsNullOrEmpty(body) (削除ここまで)~~
Longer variable names than sr, el and ms could be a little bit easier to read. "Without proper names, we are constantly decoding and reconstructing the information that should be apparent from reading the code alone." (From codesparkle's former answer.)

The innermost else could be omitted:

if (firstChar != '<')
{
 return false;
}
// ultimately, we try to parse the XML
el = XElement.Load(sr);
return true;

Question 12

1. no.... body is a byte array not a string. 2. I agree with you at some point. I think I should rename el to foundXElement or something like this. Other very local variables like ms or sr can be stay like this. It's a consistent naming convention I use in very small methods. As always, naming convention is subject to developer's feelings. 3. Technically yes, it can be omitted. But I think it's more readable when the two path are visually on the same level. A matter of preference again. Thanks for your feeback anyways.

NexAddo NexAddoNexAddo 1832 silver badges8 bronze badges · Answer 1 · 2012-08-27 16:19:28Z

Why not verify if a root node exists rather than check for a '<' ?

 public bool GetRootNode(XmlReader reader)
 {
 bool isValid;
 try
 {
 while (reader.Read())
 {
 if (reader.NodeType == XmlNodeType.Element)
 {
 isValid = true;
 break;
 }
 }
 }
 catch (XmlException x)
 {
 throw new XmlException(x.Message);
 }
 return isValid;
 }

You could also use the chain of responsibilities pattern to determine which parser you want to use.

EDIT

Not sure if this will work, but here is an example on using the above code:

protected bool TryGetXElement(byte[] body, out XElement el)
{
 el = null;
 // if there is no data, this is not xml :)
 if (body == null || body.Length == 0)
 {
 return false;
 }
 try
 {
 // Load the data into a memory stream
 using (var ms = new MemoryStream(body))
 {
 using (var sr = new StreamReader(ms))
 {
 XmlReaderSettings settings = XmlReaderSettings { CheckCharacters = true; };
 using(XmlReader reader = XmlReader.Create(ms, settings))
 {
 if (!GetRootNode(reader))
 {
 return false;
 }
 else
 {
 // ultimately, we try to parse the XML
 el = XElement.Load(sr);
 return true;
 }
 }
 }
 }
 }
 catch
 {
 // if it failed, we suppose this is not XML
 return false;
 }
}

I cannot see why this would be an enhancement. Your code rely on exception catching (so I do ultimately), while mine check one char. I believe checking a single char is very cheap compared to catching an exception, and can reduce a great number of cases (the probability of having a < char in a non xml string is very low).
You could have a '<' in a string of html. That is not xml. Does your xml have to fit a particular schema? How are you handling invalid characters that cannot appear in an xml document? Do you have to handle those cases?
I don't have schema in the sense of XSD files. I have behind the scene a bunch of "parsers" that will analyse the returned XElement. Each parser can handle a well known xml structure. And I also have parsers for non-xml document, but it's another story

score 1 · Answer 2 · 2012-09-26 19:28:23Z

Just a few small readability notes:

~~(削除) (body == null || body.Length == 0) could be String.IsNullOrEmpty(body) (削除ここまで)~~
Longer variable names than sr, el and ms could be a little bit easier to read. "Without proper names, we are constantly decoding and reconstructing the information that should be apparent from reading the code alone." (From codesparkle's former answer.)

The innermost else could be omitted:

if (firstChar != '<')
{
 return false;
}
// ultimately, we try to parse the XML
el = XElement.Load(sr);
return true;

1. no.... body is a byte array not a string. 2. I agree with you at some point. I think I should rename el to foundXElement or something like this. Other very local variables like ms or sr can be stay like this. It's a consistent naming convention I use in very small methods. As always, naming convention is subject to developer's feelings. 3. Technically yes, it can be omitted. But I think it's more readable when the two path are visually on the same level. A matter of preference again. Thanks for your feeback anyways.

Stack Exchange Network

Checking and returning Xml from a byte array

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Checking and returning Xml from a byte array

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions