Web Scraping, Serializing and Deserializing in Python

Question 1

I scraped data from a local .html page and the below code is working. I am a newbie to scraping, just tried with a simple HTML page. It takes 10 sec to complete execution and print data. If I did anything wrong or need improvements, please let me know.

from bs4 import BeautifulSoup
import json
import pyodbc
import datetime
class JsonClass:
 def __init__(self, Date, DocumentType, Procedure, NoOfPages):
 self.Date = Date
 self.DocumentType = DocumentType
 self.Procedure = Procedure
 self.NoOfPages = NoOfPages
def json_to_db(json_string):
 conn = pyodbc.connect('Driver={SQL Server};' 'Server=xyz-PAVILION;' 'Database=jsondata;' 'Trusted_Connect=yes')
 conn.autocommit = True
 cursor = conn.cursor()
 try:
 cursor.execute('EXEC prcJsonInsertData @json = ?',
 json_string) # Passing Json Data to DB via Stored Procedure(SP)
 print('Data inserted')
 except pyodbc.Error as error:
 print('Error : %s' % error)
 return False
 except:
 print('Operation Failed')
 return False
 conn.close()
 return True
def json_serialize(dict_list):
 with open('html_to_json.json', 'w') as file_out:
 json.dump(dict_list, file_out, indent=4) # Serializing dict_list and writing in .json file
 return json.dumps(dict_list)
def json_deserilaize(json_string):
 with open('html_to_json.json', 'r') as file_out:
 json_data = json.load(file_out) # Deserialization Data
 json_class = [JsonClass(**i) for i in json_data] # Binding Json_data to Json_Class
 print('********* After Deserialization *******************')
 print('-------------------------------------')
 for i in json_class:
 print('Date : ' + i.Date)
 print('DocumentType : ' + i.DocumentType)
 print('Procedure : ' + i.Procedure)
 print('NoOfPages : ' + i.NoOfPages)
 print('-------------------------------------')
def html_data():
 my_file = open("C:/Users/xyz/Downloads/sample.htm", 'r')
 soup = BeautifulSoup(my_file, 'html.parser', from_encoding="UTF-8")
 t_body = soup.find('tbody')
 rows = t_body.find_all('tr')
 dict_list = []
 for row in rows:
 column = row.find_all('td')
 column = [x.text for x in column]
 record = dict()
 date_obj = datetime.datetime.strptime(column[1], '%d.%m.%Y')
 record['Date'] = date_obj.date().isoformat()
 record['DocumentType'] = column[2]
 record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
 record['NoOfPages'] = column[4]
 dict_list.append(record)
 json_string = json_serialize(dict_list) # Func - 1
 if json_to_db(json_string): # Func - 2
 json_deserilaize(json_string) # Func - 3
if __name__ == '__main__':
 html_data()

Question 2

Please do not update the code in your question after receiving answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 3

PEP8 names

All of these variables:

def __init__(self, Date, DocumentType, Procedure, NoOfPages):
 self.Date = Date
 self.DocumentType = DocumentType
 self.Procedure = Procedure
 self.NoOfPages = NoOfPages

should be lower_snake_case.

Data classes

Replace JsonClass with a @dataclass that uses an implicit __init__.

Connection

First note that there's no point in separating out your connection parameters like

'Driver={SQL Server};' 'Server=xyz-PAVILION;'

since they will be implicitly concatenated to

'Driver={SQL Server};Server=xyz-PAVILION;'

Beyond that, the documentation states that kwargs are converted to a formatted conn string, so you're better off writing

pyodbc.connect(
 Driver='{SQL Server}',
 Server='xyz-PAVILION',
 Database='jsondata',
 Trusted_Connect='yes',
)

Context management

Use a with on your connection and cursor objects; read

https://github.com/mkleehammer/pyodbc/wiki/Connection#context-manager

Exception interference

Don't convert exceptions to booleans, as in

except pyodbc.Error as error:
 print('Error : %s' % error)
 return False
except:
 print('Operation Failed')
 return False
conn.close()
return True

Deal with the exceptions, potentially wrapping them in your own exception types, and catching them at an upper level.

Double serialization

Don't do this:

with open('html_to_json.json', 'w') as file_out:
 json.dump(dict_list, file_out, indent=4) # Serializing dict_list and writing in .json file
return json.dumps(dict_list)

Hold onto the result of dumps and write that string to the file.

Print helpers

Move this code:

 print('Date : ' + i.Date)
 print('DocumentType : ' + i.DocumentType)
 print('Procedure : ' + i.Procedure)
 print('NoOfPages : ' + i.NoOfPages)

to a method of JsonClass.

Hard-coded paths

my_file = open("C:/Users/xyz/Downloads/sample.htm", 'r')

should not be hard-coded. Set it as some kind of parameter - a command-line argument maybe.

Dict literals

 record = dict()
 date_obj = datetime.datetime.strptime(column[1], '%d.%m.%Y')
 record['Date'] = date_obj.date().isoformat()
 record['DocumentType'] = column[2]
 record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
 record['NoOfPages'] = column[4]

should not call dict(), should not individually index keys, and should instead use a dict {} literal.

Question 4

Print Helpers is not working. I created a method def PrintData(i) in JsonClass and using a loop to pass i(parameter to PrintData() method) from the json_desrialisation method. for i in json_class: ....JsonClass.PrintData(i)

Question 5

It shouldn't be def PrintData(i); it should be def print_data(self)

Question 6

I done changes to the above code. Have a look but the helper method is not working(getting a " TypeError: print_data() missing 1 required positional argument: 'i' ")

Question 7

As Mast indicated, you should not be editing this question. If your new solution no longer works, consider posting on StackOverflow.

Question 8

Looks like you are trying to get rid of the Unicode non-breaking space:

record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')

Dealing with Unicode or character set conversions can be a headache, so I would suggest that you have a look at existing libraries to "normalize" the data. For example: unicodedata.normalize To convert those non-breaking spaces to regular spaces I would try this:

import unicodedata
record['Procedure'] = unicodedata.normalize("NFKD", column[3])

The benefit would be to "downgrade" a number of pesky characters you may encounter and not just this particular one.

Some background reading: Unicode equivalence

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2021-02-13 00:11:48Z

PEP8 names

All of these variables:

def __init__(self, Date, DocumentType, Procedure, NoOfPages):
 self.Date = Date
 self.DocumentType = DocumentType
 self.Procedure = Procedure
 self.NoOfPages = NoOfPages

should be lower_snake_case.

Data classes

Replace JsonClass with a @dataclass that uses an implicit __init__.

Connection

First note that there's no point in separating out your connection parameters like

'Driver={SQL Server};' 'Server=xyz-PAVILION;'

since they will be implicitly concatenated to

'Driver={SQL Server};Server=xyz-PAVILION;'

Beyond that, the documentation states that kwargs are converted to a formatted conn string, so you're better off writing

pyodbc.connect(
 Driver='{SQL Server}',
 Server='xyz-PAVILION',
 Database='jsondata',
 Trusted_Connect='yes',
)

Context management

Use a with on your connection and cursor objects; read

https://github.com/mkleehammer/pyodbc/wiki/Connection#context-manager

Exception interference

Don't convert exceptions to booleans, as in

except pyodbc.Error as error:
 print('Error : %s' % error)
 return False
except:
 print('Operation Failed')
 return False
conn.close()
return True

Deal with the exceptions, potentially wrapping them in your own exception types, and catching them at an upper level.

Double serialization

Don't do this:

with open('html_to_json.json', 'w') as file_out:
 json.dump(dict_list, file_out, indent=4) # Serializing dict_list and writing in .json file
return json.dumps(dict_list)

Hold onto the result of dumps and write that string to the file.

Print helpers

Move this code:

 print('Date : ' + i.Date)
 print('DocumentType : ' + i.DocumentType)
 print('Procedure : ' + i.Procedure)
 print('NoOfPages : ' + i.NoOfPages)

to a method of JsonClass.

Hard-coded paths

my_file = open("C:/Users/xyz/Downloads/sample.htm", 'r')

should not be hard-coded. Set it as some kind of parameter - a command-line argument maybe.

Dict literals

 record = dict()
 date_obj = datetime.datetime.strptime(column[1], '%d.%m.%Y')
 record['Date'] = date_obj.date().isoformat()
 record['DocumentType'] = column[2]
 record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')
 record['NoOfPages'] = column[4]

should not call dict(), should not individually index keys, and should instead use a dict {} literal.

Print Helpers is not working. I created a method def PrintData(i) in JsonClass and using a loop to pass i(parameter to PrintData() method) from the json_desrialisation method. for i in json_class: ....JsonClass.PrintData(i)
It shouldn't be def PrintData(i); it should be def print_data(self)
I done changes to the above code. Have a look but the helper method is not working(getting a " TypeError: print_data() missing 1 required positional argument: 'i' ")
As Mast indicated, you should not be editing this question. If your new solution no longer works, consider posting on StackOverflow.

Kate Kate 8,0789 silver badges21 bronze badges · Answer 2 · 2021-02-13 21:07:07Z

Looks like you are trying to get rid of the Unicode non-breaking space:

record['Procedure'] = column[3].replace('\u00a0/\u00a0', '/').replace('\u00a0', '/')

Dealing with Unicode or character set conversions can be a headache, so I would suggest that you have a look at existing libraries to "normalize" the data. For example: unicodedata.normalize To convert those non-breaking spaces to regular spaces I would try this:

import unicodedata
record['Procedure'] = unicodedata.normalize("NFKD", column[3])

The benefit would be to "downgrade" a number of pesky characters you may encounter and not just this particular one.

Some background reading: Unicode equivalence

Stack Exchange Network

Web Scraping, Serializing and Deserializing in Python

2 Answers 2

PEP8 names

Data classes

Connection

Context management

Exception interference

Double serialization

Print helpers

Hard-coded paths

Dict literals

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web Scraping, Serializing and Deserializing in Python

2 Answers 2

PEP8 names

Data classes

Connection

Context management

Exception interference

Double serialization

Print helpers

Hard-coded paths

Dict literals

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions