I scraped data from a local .html page and the below code is working. I am a newbie to scraping, just tried with a simple HTML page. It takes 10 sec to complete execution and print data. If I did anything wrong or need improvements, please let me know.
from bs4 import BeautifulSoup
import json
import pyodbc
import datetime
class JsonClass:
def __init__(self, Date, DocumentType, Procedure, NoOfPages):
self.Date = Date
self.DocumentType = DocumentType
self.Procedure = Procedure
self.NoOfPages = NoOfPages
def json_to_db(json_string):
conn = pyodbc.connect('Driver={SQL Server};' 'Server=xyz-PAVILION;' 'Database=jsondata;' 'Trusted_Connect=yes')
conn.autocommit = True
cursor = conn.cursor()
try:
cursor.execute('EXEC prcJsonInsertData @json = ?',
json_string) # Passing Json Data to DB via Stored Procedure(SP)
print('Data inserted')
except pyodbc.Error as error:
print('Error : %s' % error)
return False
except:
print('Operation Failed')
return False
conn.close()
return True
def json_serialize(dict_list):
with open('html_to_json.json', 'w') as file_out:
json.dump(dict_list, file_out, indent=4) # Serializing dict_list and writing in .json file
return json.dumps(dict_list)
def json_deserilaize(json_string):
with open('html_to_json.json', 'r') as file_out:
json_data = json.load(file_out) # Deserialization Data
json_class = [JsonClass(**i) for i in json_data] # Binding Json_data to Json_Class
print('********* After Deserialization *******************')
print('-------------------------------------')
for i in json_class:
print('Date : ' + i.Date)
print('DocumentType : ' + i.DocumentType)
print('Procedure : ' + i.Procedure)
print('NoOfPages : ' + i.NoOfPages)
print('-------------------------------------')
def html_data():
my_file = open("C:/Users/xyz/Downloads/sample.htm", 'r')
soup = BeautifulSoup(my_file, 'html.parser', from_encoding="UTF-8")
t_body = soup.find('tbody')
rows = t_body.find_all('tr')
dict_list = []
for row in rows:
column = row.find_all('td')
column = [x.text for x in column]
record = dict()
date_obj = datetime.datetime.strptime(column[1], '%d.%m.%Y')
record['Date'] = date_obj.date().isoformat()
record['DocumentType'] = column[2]
record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
record['NoOfPages'] = column[4]
dict_list.append(record)
json_string = json_serialize(dict_list) # Func - 1
if json_to_db(json_string): # Func - 2
json_deserilaize(json_string) # Func - 3
if __name__ == '__main__':
html_data()
-
1\$\begingroup\$ Please do not update the code in your question after receiving answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Mast– Mast ♦2021年02月13日 19:20:54 +00:00Commented Feb 13, 2021 at 19:20
2 Answers 2
PEP8 names
All of these variables:
def __init__(self, Date, DocumentType, Procedure, NoOfPages):
self.Date = Date
self.DocumentType = DocumentType
self.Procedure = Procedure
self.NoOfPages = NoOfPages
should be lower_snake_case.
Data classes
Replace JsonClass
with a @dataclass
that uses an implicit __init__
.
Connection
First note that there's no point in separating out your connection parameters like
'Driver={SQL Server};' 'Server=xyz-PAVILION;'
since they will be implicitly concatenated to
'Driver={SQL Server};Server=xyz-PAVILION;'
Beyond that, the documentation states that kwargs are converted to a formatted conn string, so you're better off writing
pyodbc.connect(
Driver='{SQL Server}',
Server='xyz-PAVILION',
Database='jsondata',
Trusted_Connect='yes',
)
Context management
Use a with
on your connection and cursor objects; read
https://github.com/mkleehammer/pyodbc/wiki/Connection#context-manager
Exception interference
Don't convert exceptions to booleans, as in
except pyodbc.Error as error:
print('Error : %s' % error)
return False
except:
print('Operation Failed')
return False
conn.close()
return True
Deal with the exceptions, potentially wrapping them in your own exception types, and catching them at an upper level.
Double serialization
Don't do this:
with open('html_to_json.json', 'w') as file_out:
json.dump(dict_list, file_out, indent=4) # Serializing dict_list and writing in .json file
return json.dumps(dict_list)
Hold onto the result of dumps
and write that string to the file.
Print helpers
Move this code:
print('Date : ' + i.Date)
print('DocumentType : ' + i.DocumentType)
print('Procedure : ' + i.Procedure)
print('NoOfPages : ' + i.NoOfPages)
to a method of JsonClass
.
Hard-coded paths
my_file = open("C:/Users/xyz/Downloads/sample.htm", 'r')
should not be hard-coded. Set it as some kind of parameter - a command-line argument maybe.
Dict literals
record = dict()
date_obj = datetime.datetime.strptime(column[1], '%d.%m.%Y')
record['Date'] = date_obj.date().isoformat()
record['DocumentType'] = column[2]
record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
record['NoOfPages'] = column[4]
should not call dict()
, should not individually index keys, and should instead use a dict {}
literal.
-
\$\begingroup\$ Print Helpers is not working. I created a method def PrintData(i) in JsonClass and using a loop to pass i(parameter to PrintData() method) from the json_desrialisation method. for i in json_class: ....JsonClass.PrintData(i) \$\endgroup\$user15069057– user150690572021年02月13日 18:36:28 +00:00Commented Feb 13, 2021 at 18:36
-
\$\begingroup\$ It shouldn't be
def PrintData(i)
; it should bedef print_data(self)
\$\endgroup\$Reinderien– Reinderien2021年02月13日 18:38:26 +00:00Commented Feb 13, 2021 at 18:38 -
\$\begingroup\$ I done changes to the above code. Have a look but the helper method is not working(getting a " TypeError: print_data() missing 1 required positional argument: 'i' ") \$\endgroup\$user15069057– user150690572021年02月13日 19:17:00 +00:00Commented Feb 13, 2021 at 19:17
-
\$\begingroup\$ As Mast indicated, you should not be editing this question. If your new solution no longer works, consider posting on StackOverflow. \$\endgroup\$Reinderien– Reinderien2021年02月14日 17:36:58 +00:00Commented Feb 14, 2021 at 17:36
Looks like you are trying to get rid of the Unicode non-breaking space:
record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
Dealing with Unicode or character set conversions can be a headache, so I would suggest that you have a look at existing libraries to "normalize" the data. For example: unicodedata.normalize To convert those non-breaking spaces to regular spaces I would try this:
import unicodedata
record['Procedure'] = unicodedata.normalize("NFKD", column[3])
The benefit would be to "downgrade" a number of pesky characters you may encounter and not just this particular one.
Some background reading: Unicode equivalence