Parsing HTML table into Pandas DataFrame

Question 1

There is a text (link clickable) file with HTML table. The table is a bank statement. I'd like to parse it into pandas DataFrame. Is there a way to do it more gracefully? I've started to learn Python recently so there is a good chance you guys can give me a good advice.

from bs4 import BeautifulSoup
import pandas as pd
with open("sber2.txt", "r", encoding = "UTF8") as f:
 context = f.read()
 soup = BeautifulSoup(context, 'html.parser')
rows_dates = soup.find_all(attrs = {'data-bind':'momentDateText: date'})
rows_category = soup.find_all(attrs = {'data-bind' : 'text: categoryName'})
rows_comment = soup.find_all(attrs = {'data-bind' : 'text: comment'})
rows_money = soup.find_all(attrs = {'data-bind' : 'currencyText: nationalAmount'})
dic = {
 "dates" : [],
 "category" : [],
 "comment": [],
 "money" : []
 }
i = 0
while i < len(rows_dates):
 dic["dates"].append(rows_dates[i].text)
 dic["category"].append(rows_category[i].text)
 dic["comment"].append(rows_comment[i].text)
 dic["money"].append(rows_money[i].text)
 '''
 print(
 rows_dates[i].text, rows_category[i].text,
 rows_comment[i].text, rows_money[i].text)
 '''
 i += 1
df = pd.DataFrame(dic)
df.info()
print(df.head())

Output:

RangeIndex: 18 entries, 0 to 17
Data columns (total 4 columns):
category 18 non-null object
comment 18 non-null object
dates 18 non-null object
money 18 non-null object
dtypes: object(4)
memory usage: 656.0+ bytes
 category comment dates money
0 Supermarkets PYATEROCHKA 1168 SAMARA RU 28.12.2017 -456,85
1 Supermarkets KARUSEL SAMARA RU 26.12.2017 -710,78
2 Supermarkets PYATEROCHKA 1168 SAMARA RU 24.12.2017 -800,24
3 Supermarkets AUCHAN SAMARA IKEA SAMARA RU 19.12.2017 -154,38
4 Supermarkets PYATEROCHKA 9481 SAMARA RU 16.12.2017 -188,80

Question 2

zip() with a list comprehension to the rescue:

rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'})
rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'})
rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'})
rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})
data = [
 {
 "dates": date.get_text(),
 "category": category.get_text(),
 "comment": comment.get_text(),
 "money": money.get_text()
 }
 for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money)
]

Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns argument:

rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})]
rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})]
rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})]
rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})]
data = list(zip(rows_dates, rows_category, rows_comment, rows_money))
df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"])
df = pd.DataFrame(data)

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-12-31 17:56:01Z

zip() with a list comprehension to the rescue:

rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'})
rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'})
rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'})
rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})
data = [
 {
 "dates": date.get_text(),
 "category": category.get_text(),
 "comment": comment.get_text(),
 "money": money.get_text()
 }
 for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money)
]

Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns argument:

rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})]
rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})]
rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})]
rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})]
data = list(zip(rows_dates, rows_category, rows_comment, rows_money))
df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"])
df = pd.DataFrame(data)

Stack Exchange Network

Parsing HTML table into Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing HTML table into Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions