There is a text (link clickable) file with HTML table. The table is a bank statement. I'd like to parse it into pandas DataFrame. Is there a way to do it more gracefully? I've started to learn Python recently so there is a good chance you guys can give me a good advice.
from bs4 import BeautifulSoup
import pandas as pd
with open("sber2.txt", "r", encoding = "UTF8") as f:
context = f.read()
soup = BeautifulSoup(context, 'html.parser')
rows_dates = soup.find_all(attrs = {'data-bind':'momentDateText: date'})
rows_category = soup.find_all(attrs = {'data-bind' : 'text: categoryName'})
rows_comment = soup.find_all(attrs = {'data-bind' : 'text: comment'})
rows_money = soup.find_all(attrs = {'data-bind' : 'currencyText: nationalAmount'})
dic = {
"dates" : [],
"category" : [],
"comment": [],
"money" : []
}
i = 0
while i < len(rows_dates):
dic["dates"].append(rows_dates[i].text)
dic["category"].append(rows_category[i].text)
dic["comment"].append(rows_comment[i].text)
dic["money"].append(rows_money[i].text)
'''
print(
rows_dates[i].text, rows_category[i].text,
rows_comment[i].text, rows_money[i].text)
'''
i += 1
df = pd.DataFrame(dic)
df.info()
print(df.head())
Output:
RangeIndex: 18 entries, 0 to 17
Data columns (total 4 columns):
category 18 non-null object
comment 18 non-null object
dates 18 non-null object
money 18 non-null object
dtypes: object(4)
memory usage: 656.0+ bytes
category comment dates money
0 Supermarkets PYATEROCHKA 1168 SAMARA RU 28.12.2017 -456,85
1 Supermarkets KARUSEL SAMARA RU 26.12.2017 -710,78
2 Supermarkets PYATEROCHKA 1168 SAMARA RU 24.12.2017 -800,24
3 Supermarkets AUCHAN SAMARA IKEA SAMARA RU 19.12.2017 -154,38
4 Supermarkets PYATEROCHKA 9481 SAMARA RU 16.12.2017 -188,80
1 Answer 1
zip()
with a list comprehension to the rescue:
rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'})
rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'})
rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'})
rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})
data = [
{
"dates": date.get_text(),
"category": category.get_text(),
"comment": comment.get_text(),
"money": money.get_text()
}
for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money)
]
Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns
argument:
rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})]
rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})]
rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})]
rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})]
data = list(zip(rows_dates, rows_category, rows_comment, rows_money))
df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"])
df = pd.DataFrame(data)
Explore related questions
See similar questions with these tags.