Determine if a file is either in JSON or XML format

Question 1

The purpose of this function is to determine if report_input_file is either in JSON format or XML format, this is what I came up with, just want to know if this is best practice / pythonic way of achieving this?

def parse_report_file(self, report_input_file):
 """ Checks if input report file is of type json or xml """
 parsed_report_file = None
 try:
 parsed_report_file = self._parse_json_file(report_input_file)
 self.is_json_report_file = True
 except Exception, e:
 parsed_report_file = self._parse_xml_file(report_input_file)
 self.is_json_report_file = False
 return parsed_report_file

Question 2

As far as I know, it's enough to check if(s.startswith('<') where s is the first in your file. An XML document entity (in common parlance, an XML document) must start with <.

Question 3

@Dex'ter that's a good point, but I will I guess always need the try-catch for cases where the report_input_file is on neither of these two formats.

Question 4

@Dex'ter Can XML files not have leading whitespace?

Question 5

@JonB is the purpose to check if they're intended to be JSON/XML, or if they're valid JSON/XML? The two have to potential to diverge wildly. I also imagine that you'd be able to identify, for at least some files, that they appear to be invalid JSON as opposed to invalid XML, but that might be more than what you want.

Question 6

There could also be a couple of bytes for a BOM at the start of file.

Question 7

Well, if you only need to distinguish between JSON and XML formats I'd keep it simple. If the encoding of that string is known to you (or is ASCII or UTF), then looking at the very first char in the file should be enough. Your code looks pythonic to me, but I'd rather do it this way:

def parse_report_file(report_input_file):
 with open(report_input_file) as unknown_file:
 c = unknown_file.read(1)
 if c != '<':
 return 'Is JSON'
 return 'Is XML'

While it is legal for JSON data structures to begin with null, true, false you can avoid those situations if you already know a bit about your data structures.

If you also need to distinguish between JSON, XML and other random file, you should simply strip any whitespace before you look at the "first" char.

Question 8

The problem with this is that it identifies invalid XML files as JSON.

Question 9

@Dancrumb But if the invalid XML is parsed as JSON, it will still be flagged as invalid.

Question 10

@Dancrumb An invalid XML that is identified as JSON isn't a big deal though. Invalid is invalid and will get rejected when parsing the JSON. If there's a problem with this, it's that it ignores whitespace, and I think that's a pretty minor problem that would be beyond the scope of this review anyway.

Question 11

Right, but the JSON parser's error message will refer to the invalidity of the file wrt the JSON spec, not the XML spec. This makes things harder to debug for users of this function.

Question 12

It's not true that "invalid is invalid". Parsers report the reason for the invalidity. Applying a parser implies a set of expectations around the format of the file and allows the parser to report when those expectations were missed. If you apply the wrong parser to a file, you'll get reports about irrelevant expectations, making debugging unnecessarily harder.

Question 13

Go the way you did already. Taking a cheaper approach like testing the first (non whitespace) character might work on input that always succeeds on parsing. But as soon as you encounter an invalid file you're stuck. You probably won't get the correct error messages.

Implementing a parser for XML and / or JSON with regex only will not work. You will have several edge cases where you achieve false positives or false negatives (e.g. XML recognized as JSON and vice versa). You're parser won't be as efficient as the libraries you're using to handle these formats. You will achieve less performance AND more bugs. Just don't do it!

About the python style guide thing I can't judge as I'm no python guru.

Question 14

Listen to this advice, don't Implement your own xml/html/json parser in regex unless you are a masochist.

Question 15

As Dexter said, you could keep it simple and just look at the first character. The flaw being noted with that is it might return "Is JSON" if it's malformed XML and vice-versa, so if you wanted, you could return 1 of three options. So to expand a bit and give you something to build upon. You could use regular expressions to match the first and last characters (and build from here for anything in between).

import 're'
def parse_report_file(report_input_file):
 with open(report_input_file, 'r') as unknown_file:
 # Remove tabs, spaces, and new lines when reading
 data = re.sub(r'\s+', '', unknown_file.read())
 if (re.match(r'^<.+>$', data)):
 return 'Is XML'
 if (re.match(r'^({|[).+(}|])$', data)):
 return 'Is JSON'
 return 'Is INVALID'

The second regex isn't matching the same opening braces with closing braces, so you could do that there. You could also read the entire file in without removing the spaces and then ignore any number of beginning and ending spaces when doing additional regex matches, but I did it this way for a little more readability.

Anyway, just another example. Definitely not the most performant but showing you some options.

Question 16

Regex are not for parsing... Going the way you suggest just leads to the same problems which the "simpler" or at least more light weight suggestion suffers from. Reimplemeting the whole parser for JSON is one thing. But XML? This just leads to awful maintainability. And your performance won't be as good as you think at all. There is a reason why pre-built, highly optimized parsing libraries do exist at all.

Question 17

I commented that the performance would not be good, but that I was showing a regex option. And though the name of the function begins with parse, I'm not sure the op's question was to completely determine whether the entire file was valid. I'm not going to write an entire parser in my answer, but showing the possibility to match some things with regex. I already noted in the answer both things you bring up in the comment.

Question 18

I just wanted to express another clear warning to not go that path as it will lead to the problems you already stated. Not having a full parser is (in my opinion) a bug. You don't know when it fails. But when it does you're f***ed. I didn't want to offend you!

Question 19

Not offended at all. I just think that the op is not looking for a full parsing solution, as judging by the other answers/comments, it will fail up if invalid.

Question 20

I'll assume that you want to actually check these are valid xml or json documents (or neither).

I see no issue with your method structure, apart from catching Exception which is probably too general. You either want to be catching something more specific like ValueError or custom error for your types.

I'm not a fan of the other suggestions, just checking the first character in the file, nor using regex to strip whitespace then look for a certain variety of brackets. There are good modules in the standard library available that will parse JSON or XML, and do the work for you, for example json and ElementTree.

Here's a python 2 / 3 compatible version of how you could fit those modules into your existing class.

I've done it fairly simply. Some people prefer not to do IO or slow things in the object __init__ but if your class is really an abstract representation of a json/xml document then it is the perfect situation to use @property for getter-only attributes and set them in the parse_report_file method.

To take this further of course it needs some unit tests, but I've done a quick test by adding some things to the entry point. It's expecting 3 files, here's the contents if you want to try.

real.xml

<?xml version='1.0' encoding='utf-8' ?>
<root>
 <some-node thing='stuff'>some text</some-node>
</root>

real.json

{ 
 "key" : "value" 
}

blank.txt

This file is really empty but it could just be anything that is not valid xml or json.

Code

from __future__ import print_function
import xml.etree.cElementTree as ET
import json
class ReportDocument(object):
 """
 ReportDocument is an abstraction of a JSON, XML or neither file type.
 """
 def __init__(self, report_input_file=None, *args, **kwargs):
 self._is_json_report_file=False
 self._is_xml_report_file=False
 self._parsed_report_file=None
 if report_input_file is not None:
 self.parse_report_file(report_input_file)
 @property
 def is_json_report_file(self):
 return self._is_json_report_file
 @property
 def is_xml_report_file(self):
 return self._is_xml_report_file
 def _parse_json_file(self,report_input_file):
 """
 Given a text file, returns a JSON object or None
 """
 with open(report_input_file,'rb') as in_file:
 return json.load(in_file)
 def _parse_xml_file(self,report_input_file):
 """
 Given a text file, returns an XML element or None
 """
 with open(report_input_file,'rb') as in_file:
 return ET.parse(in_file)
 def parse_report_file(self, report_input_file):
 """ 
 Checks if input report file is of type json or xml 
 returns a json object or an xml element or None
 """
 try:
 self._parsed_report_file = self._parse_xml_file(report_input_file)
 self._is_xml_report_file = True
 except ET.ParseError:
 try:
 self._parsed_report_file = self._parse_json_file(report_input_file)
 self._is_json_report_file = True
 except ValueError: # or in Python3 the specific error json.decoder.JSONDecodeError
 pass
if __name__ == '__main__':
 files_to_test = ['real.xml', 'real.json', 'blank.txt']
 for file_type in files_to_test:
 test_class = ReportDocument(file_type)
 print("{file_type}: \n\t is_json:{is_json_report_file} \n\t is_xml:{is_xml_report_file}".format(
 file_type=file_type, 
 is_json_report_file=test_class.is_json_report_file,
 is_xml_report_file = test_class.is_xml_report_file)
 )

score 7 · Answer 1 · 2016-08-05 13:37:29Z

7

\$\begingroup\$

Well, if you only need to distinguish between JSON and XML formats I'd keep it simple. If the encoding of that string is known to you (or is ASCII or UTF), then looking at the very first char in the file should be enough. Your code looks pythonic to me, but I'd rather do it this way:

def parse_report_file(report_input_file):
 with open(report_input_file) as unknown_file:
 c = unknown_file.read(1)
 if c != '<':
 return 'Is JSON'
 return 'Is XML'

While it is legal for JSON data structures to begin with null, true, false you can avoid those situations if you already know a bit about your data structures.

If you also need to distinguish between JSON, XML and other random file, you should simply strip any whitespace before you look at the "first" char.

Share

edited Aug 9, 2016 at 7:08

answered Aug 5, 2016 at 13:37

Grajdeanu Alex's user avatar

Grajdeanu Alex Grajdeanu Alex

9,3164 gold badges32 silver badges71 bronze badges

\$\endgroup\$

7

\$\begingroup\$ The problem with this is that it identifies invalid XML files as JSON. \$\endgroup\$

Dancrumb
– Dancrumb

2016年08月05日 15:34:58 +00:00
Commented Aug 5, 2016 at 15:34
\$\begingroup\$ @Dancrumb But if the invalid XML is parsed as JSON, it will still be flagged as invalid. \$\endgroup\$

MTCoster
– MTCoster

2016年08月05日 15:41:09 +00:00
Commented Aug 5, 2016 at 15:41
1

\$\begingroup\$ @Dancrumb An invalid XML that is identified as JSON isn't a big deal though. Invalid is invalid and will get rejected when parsing the JSON. If there's a problem with this, it's that it ignores whitespace, and I think that's a pretty minor problem that would be beyond the scope of this review anyway. \$\endgroup\$

corsiKa
– corsiKa

2016年08月05日 15:42:54 +00:00
Commented Aug 5, 2016 at 15:42
\$\begingroup\$ Right, but the JSON parser's error message will refer to the invalidity of the file wrt the JSON spec, not the XML spec. This makes things harder to debug for users of this function. \$\endgroup\$

Dancrumb
– Dancrumb

2016年08月05日 15:43:46 +00:00
Commented Aug 5, 2016 at 15:43
\$\begingroup\$ It's not true that "invalid is invalid". Parsers report the reason for the invalidity. Applying a parser implies a set of expectations around the format of the file and allows the parser to report when those expectations were missed. If you apply the wrong parser to a file, you'll get reports about irrelevant expectations, making debugging unnecessarily harder. \$\endgroup\$

Dancrumb
– Dancrumb

2016年08月05日 15:45:32 +00:00
Commented Aug 5, 2016 at 15:45

| Show 2 more comments

bash0r bash0r 1411 bronze badge · Answer 2 · 2016-08-06 01:51:44Z

Go the way you did already. Taking a cheaper approach like testing the first (non whitespace) character might work on input that always succeeds on parsing. But as soon as you encounter an invalid file you're stuck. You probably won't get the correct error messages.

Implementing a parser for XML and / or JSON with regex only will not work. You will have several edge cases where you achieve false positives or false negatives (e.g. XML recognized as JSON and vice versa). You're parser won't be as efficient as the libraries you're using to handle these formats. You will achieve less performance AND more bugs. Just don't do it!

About the python style guide thing I can't judge as I'm no python guru.

Listen to this advice, don't Implement your own xml/html/json parser in regex unless you are a masochist.

Nate Nate 1112 bronze badges · Answer 3 · 2016-08-05 19:30:20Z

As Dexter said, you could keep it simple and just look at the first character. The flaw being noted with that is it might return "Is JSON" if it's malformed XML and vice-versa, so if you wanted, you could return 1 of three options. So to expand a bit and give you something to build upon. You could use regular expressions to match the first and last characters (and build from here for anything in between).

import 're'
def parse_report_file(report_input_file):
 with open(report_input_file, 'r') as unknown_file:
 # Remove tabs, spaces, and new lines when reading
 data = re.sub(r'\s+', '', unknown_file.read())
 if (re.match(r'^<.+>$', data)):
 return 'Is XML'
 if (re.match(r'^({|[).+(}|])$', data)):
 return 'Is JSON'
 return 'Is INVALID'

The second regex isn't matching the same opening braces with closing braces, so you could do that there. You could also read the entire file in without removing the spaces and then ignore any number of beginning and ending spaces when doing additional regex matches, but I did it this way for a little more readability.

Anyway, just another example. Definitely not the most performant but showing you some options.

Regex are not for parsing... Going the way you suggest just leads to the same problems which the "simpler" or at least more light weight suggestion suffers from. Reimplemeting the whole parser for JSON is one thing. But XML? This just leads to awful maintainability. And your performance won't be as good as you think at all. There is a reason why pre-built, highly optimized parsing libraries do exist at all.
I commented that the performance would not be good, but that I was showing a regex option. And though the name of the function begins with parse, I'm not sure the op's question was to completely determine whether the entire file was valid. I'm not going to write an entire parser in my answer, but showing the possibility to match some things with regex. I already noted in the answer both things you bring up in the comment.
I just wanted to express another clear warning to not go that path as it will lead to the problems you already stated. Not having a full parser is (in my opinion) a bug. You don't know when it fails. But when it does you're f***ed. I didn't want to offend you!
Not offended at all. I just think that the op is not looking for a full parsing solution, as judging by the other answers/comments, it will fail up if invalid.

Davos Davos 1135 bronze badges · Answer 4 · 2018-04-23 14:11:32Z

I'll assume that you want to actually check these are valid xml or json documents (or neither).

I see no issue with your method structure, apart from catching Exception which is probably too general. You either want to be catching something more specific like ValueError or custom error for your types.

I'm not a fan of the other suggestions, just checking the first character in the file, nor using regex to strip whitespace then look for a certain variety of brackets. There are good modules in the standard library available that will parse JSON or XML, and do the work for you, for example json and ElementTree.

Here's a python 2 / 3 compatible version of how you could fit those modules into your existing class.

I've done it fairly simply. Some people prefer not to do IO or slow things in the object __init__ but if your class is really an abstract representation of a json/xml document then it is the perfect situation to use @property for getter-only attributes and set them in the parse_report_file method.

To take this further of course it needs some unit tests, but I've done a quick test by adding some things to the entry point. It's expecting 3 files, here's the contents if you want to try.

real.xml

<?xml version='1.0' encoding='utf-8' ?>
<root>
 <some-node thing='stuff'>some text</some-node>
</root>

real.json

{ 
 "key" : "value" 
}

blank.txt

This file is really empty but it could just be anything that is not valid xml or json.

Code

from __future__ import print_function
import xml.etree.cElementTree as ET
import json
class ReportDocument(object):
 """
 ReportDocument is an abstraction of a JSON, XML or neither file type.
 """
 def __init__(self, report_input_file=None, *args, **kwargs):
 self._is_json_report_file=False
 self._is_xml_report_file=False
 self._parsed_report_file=None
 if report_input_file is not None:
 self.parse_report_file(report_input_file)
 @property
 def is_json_report_file(self):
 return self._is_json_report_file
 @property
 def is_xml_report_file(self):
 return self._is_xml_report_file
 def _parse_json_file(self,report_input_file):
 """
 Given a text file, returns a JSON object or None
 """
 with open(report_input_file,'rb') as in_file:
 return json.load(in_file)
 def _parse_xml_file(self,report_input_file):
 """
 Given a text file, returns an XML element or None
 """
 with open(report_input_file,'rb') as in_file:
 return ET.parse(in_file)
 def parse_report_file(self, report_input_file):
 """ 
 Checks if input report file is of type json or xml 
 returns a json object or an xml element or None
 """
 try:
 self._parsed_report_file = self._parse_xml_file(report_input_file)
 self._is_xml_report_file = True
 except ET.ParseError:
 try:
 self._parsed_report_file = self._parse_json_file(report_input_file)
 self._is_json_report_file = True
 except ValueError: # or in Python3 the specific error json.decoder.JSONDecodeError
 pass
if __name__ == '__main__':
 files_to_test = ['real.xml', 'real.json', 'blank.txt']
 for file_type in files_to_test:
 test_class = ReportDocument(file_type)
 print("{file_type}: \n\t is_json:{is_json_report_file} \n\t is_xml:{is_xml_report_file}".format(
 file_type=file_type, 
 is_json_report_file=test_class.is_json_report_file,
 is_xml_report_file = test_class.is_xml_report_file)
 )

Stack Exchange Network

Determine if a file is either in JSON or XML format

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Determine if a file is either in JSON or XML format

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions