The purpose of this function is to determine if report_input_file
is either in JSON format or XML format, this is what I came up with, just want to know if this is best practice / pythonic way of achieving this?
def parse_report_file(self, report_input_file):
""" Checks if input report file is of type json or xml """
parsed_report_file = None
try:
parsed_report_file = self._parse_json_file(report_input_file)
self.is_json_report_file = True
except Exception, e:
parsed_report_file = self._parse_xml_file(report_input_file)
self.is_json_report_file = False
return parsed_report_file
4 Answers 4
Well, if you only need to distinguish between JSON
and XML
formats I'd keep it simple. If the encoding of that string is known to you (or is ASCII or UTF), then looking at the very first char in the file should be enough. Your code looks pythonic to me, but I'd rather do it this way:
def parse_report_file(report_input_file):
with open(report_input_file) as unknown_file:
c = unknown_file.read(1)
if c != '<':
return 'Is JSON'
return 'Is XML'
While it is legal for JSON
data structures to begin with null
, true
, false
you can avoid those situations if you already know a bit about your data structures.
If you also need to distinguish between JSON
, XML
and other random file, you should simply strip any whitespace before you look at the "first" char.
-
\$\begingroup\$ The problem with this is that it identifies invalid XML files as JSON. \$\endgroup\$Dancrumb– Dancrumb2016年08月05日 15:34:58 +00:00Commented Aug 5, 2016 at 15:34
-
\$\begingroup\$ @Dancrumb But if the invalid XML is parsed as JSON, it will still be flagged as invalid. \$\endgroup\$MTCoster– MTCoster2016年08月05日 15:41:09 +00:00Commented Aug 5, 2016 at 15:41
-
1\$\begingroup\$ @Dancrumb An invalid XML that is identified as JSON isn't a big deal though. Invalid is invalid and will get rejected when parsing the JSON. If there's a problem with this, it's that it ignores whitespace, and I think that's a pretty minor problem that would be beyond the scope of this review anyway. \$\endgroup\$corsiKa– corsiKa2016年08月05日 15:42:54 +00:00Commented Aug 5, 2016 at 15:42
-
\$\begingroup\$ Right, but the JSON parser's error message will refer to the invalidity of the file wrt the JSON spec, not the XML spec. This makes things harder to debug for users of this function. \$\endgroup\$Dancrumb– Dancrumb2016年08月05日 15:43:46 +00:00Commented Aug 5, 2016 at 15:43
-
\$\begingroup\$ It's not true that "invalid is invalid". Parsers report the reason for the invalidity. Applying a parser implies a set of expectations around the format of the file and allows the parser to report when those expectations were missed. If you apply the wrong parser to a file, you'll get reports about irrelevant expectations, making debugging unnecessarily harder. \$\endgroup\$Dancrumb– Dancrumb2016年08月05日 15:45:32 +00:00Commented Aug 5, 2016 at 15:45
Go the way you did already. Taking a cheaper approach like testing the first (non whitespace) character might work on input that always succeeds on parsing. But as soon as you encounter an invalid file you're stuck. You probably won't get the correct error messages.
Implementing a parser for XML and / or JSON with regex only will not work. You will have several edge cases where you achieve false positives or false negatives (e.g. XML recognized as JSON and vice versa). You're parser won't be as efficient as the libraries you're using to handle these formats. You will achieve less performance AND more bugs. Just don't do it!
About the python style guide thing I can't judge as I'm no python guru.
-
1\$\begingroup\$ Listen to this advice, don't Implement your own xml/html/json parser in regex unless you are a masochist. \$\endgroup\$Davos– Davos2018年04月23日 06:45:43 +00:00Commented Apr 23, 2018 at 6:45
As Dexter said, you could keep it simple and just look at the first character. The flaw being noted with that is it might return "Is JSON" if it's malformed XML and vice-versa, so if you wanted, you could return 1 of three options. So to expand a bit and give you something to build upon. You could use regular expressions to match the first and last characters (and build from here for anything in between).
import 're'
def parse_report_file(report_input_file):
with open(report_input_file, 'r') as unknown_file:
# Remove tabs, spaces, and new lines when reading
data = re.sub(r'\s+', '', unknown_file.read())
if (re.match(r'^<.+>$', data)):
return 'Is XML'
if (re.match(r'^({|[).+(}|])$', data)):
return 'Is JSON'
return 'Is INVALID'
The second regex isn't matching the same opening braces with closing braces, so you could do that there. You could also read the entire file in without removing the spaces and then ignore any number of beginning and ending spaces when doing additional regex matches, but I did it this way for a little more readability.
Anyway, just another example. Definitely not the most performant but showing you some options.
-
\$\begingroup\$ Regex are not for parsing... Going the way you suggest just leads to the same problems which the "simpler" or at least more light weight suggestion suffers from. Reimplemeting the whole parser for JSON is one thing. But XML? This just leads to awful maintainability. And your performance won't be as good as you think at all. There is a reason why pre-built, highly optimized parsing libraries do exist at all. \$\endgroup\$bash0r– bash0r2016年08月06日 01:42:42 +00:00Commented Aug 6, 2016 at 1:42
-
\$\begingroup\$ I commented that the performance would not be good, but that I was showing a regex option. And though the name of the function begins with parse, I'm not sure the op's question was to completely determine whether the entire file was valid. I'm not going to write an entire parser in my answer, but showing the possibility to match some things with regex. I already noted in the answer both things you bring up in the comment. \$\endgroup\$Nate– Nate2016年08月06日 01:52:32 +00:00Commented Aug 6, 2016 at 1:52
-
\$\begingroup\$ I just wanted to express another clear warning to not go that path as it will lead to the problems you already stated. Not having a full parser is (in my opinion) a bug. You don't know when it fails. But when it does you're f***ed. I didn't want to offend you! \$\endgroup\$bash0r– bash0r2016年08月06日 01:54:49 +00:00Commented Aug 6, 2016 at 1:54
-
\$\begingroup\$ Not offended at all. I just think that the op is not looking for a full parsing solution, as judging by the other answers/comments, it will fail up if invalid. \$\endgroup\$Nate– Nate2016年08月06日 02:01:21 +00:00Commented Aug 6, 2016 at 2:01
I'll assume that you want to actually check these are valid xml or json documents (or neither).
I see no issue with your method structure, apart from catching Exception
which is probably too general. You either want to be catching something more specific like ValueError or custom error for your types.
I'm not a fan of the other suggestions, just checking the first character in the file, nor using regex to strip whitespace then look for a certain variety of brackets. There are good modules in the standard library available that will parse JSON or XML, and do the work for you, for example json and ElementTree.
Here's a python 2 / 3 compatible version of how you could fit those modules into your existing class.
I've done it fairly simply. Some people prefer not to do IO or slow things in the object __init__
but if your class is really an abstract representation of a json/xml document then it is the perfect situation to use @property
for getter-only attributes and set them in the parse_report_file
method.
To take this further of course it needs some unit tests, but I've done a quick test by adding some things to the entry point. It's expecting 3 files, here's the contents if you want to try.
real.xml
<?xml version='1.0' encoding='utf-8' ?>
<root>
<some-node thing='stuff'>some text</some-node>
</root>
real.json
{
"key" : "value"
}
blank.txt
This file is really empty but it could just be anything that is not valid xml or json.
Code
from __future__ import print_function
import xml.etree.cElementTree as ET
import json
class ReportDocument(object):
"""
ReportDocument is an abstraction of a JSON, XML or neither file type.
"""
def __init__(self, report_input_file=None, *args, **kwargs):
self._is_json_report_file=False
self._is_xml_report_file=False
self._parsed_report_file=None
if report_input_file is not None:
self.parse_report_file(report_input_file)
@property
def is_json_report_file(self):
return self._is_json_report_file
@property
def is_xml_report_file(self):
return self._is_xml_report_file
def _parse_json_file(self,report_input_file):
"""
Given a text file, returns a JSON object or None
"""
with open(report_input_file,'rb') as in_file:
return json.load(in_file)
def _parse_xml_file(self,report_input_file):
"""
Given a text file, returns an XML element or None
"""
with open(report_input_file,'rb') as in_file:
return ET.parse(in_file)
def parse_report_file(self, report_input_file):
"""
Checks if input report file is of type json or xml
returns a json object or an xml element or None
"""
try:
self._parsed_report_file = self._parse_xml_file(report_input_file)
self._is_xml_report_file = True
except ET.ParseError:
try:
self._parsed_report_file = self._parse_json_file(report_input_file)
self._is_json_report_file = True
except ValueError: # or in Python3 the specific error json.decoder.JSONDecodeError
pass
if __name__ == '__main__':
files_to_test = ['real.xml', 'real.json', 'blank.txt']
for file_type in files_to_test:
test_class = ReportDocument(file_type)
print("{file_type}: \n\t is_json:{is_json_report_file} \n\t is_xml:{is_xml_report_file}".format(
file_type=file_type,
is_json_report_file=test_class.is_json_report_file,
is_xml_report_file = test_class.is_xml_report_file)
)
if(s.startswith('<')
wheres
is the first in your file. An XML document entity (in common parlance, an XML document) must start with<
. \$\endgroup\$try-catch
for cases where thereport_input_file
is on neither of these two formats. \$\endgroup\$