2
\$\begingroup\$

There is a text (link clickable) file with HTML table. The table is a bank statement. I'd like to parse it into pandas DataFrame. Is there a way to do it more gracefully? I've started to learn Python recently so there is a good chance you guys can give me a good advice.

from bs4 import BeautifulSoup
import pandas as pd
with open("sber2.txt", "r", encoding = "UTF8") as f:
 context = f.read()
 soup = BeautifulSoup(context, 'html.parser')
rows_dates = soup.find_all(attrs = {'data-bind':'momentDateText: date'})
rows_category = soup.find_all(attrs = {'data-bind' : 'text: categoryName'})
rows_comment = soup.find_all(attrs = {'data-bind' : 'text: comment'})
rows_money = soup.find_all(attrs = {'data-bind' : 'currencyText: nationalAmount'})
dic = {
 "dates" : [],
 "category" : [],
 "comment": [],
 "money" : []
 }
i = 0
while i < len(rows_dates):
 dic["dates"].append(rows_dates[i].text)
 dic["category"].append(rows_category[i].text)
 dic["comment"].append(rows_comment[i].text)
 dic["money"].append(rows_money[i].text)
 '''
 print(
 rows_dates[i].text, rows_category[i].text,
 rows_comment[i].text, rows_money[i].text)
 '''
 i += 1
df = pd.DataFrame(dic)
df.info()
print(df.head())

Output:

RangeIndex: 18 entries, 0 to 17
Data columns (total 4 columns):
category 18 non-null object
comment 18 non-null object
dates 18 non-null object
money 18 non-null object
dtypes: object(4)
memory usage: 656.0+ bytes
 category comment dates money
0 Supermarkets PYATEROCHKA 1168 SAMARA RU 28.12.2017 -456,85
1 Supermarkets KARUSEL SAMARA RU 26.12.2017 -710,78
2 Supermarkets PYATEROCHKA 1168 SAMARA RU 24.12.2017 -800,24
3 Supermarkets AUCHAN SAMARA IKEA SAMARA RU 19.12.2017 -154,38
4 Supermarkets PYATEROCHKA 9481 SAMARA RU 16.12.2017 -188,80
asked Dec 31, 2017 at 17:27
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

zip() with a list comprehension to the rescue:

rows_dates = soup.find_all(attrs={'data-bind': 'momentDateText: date'})
rows_category = soup.find_all(attrs={'data-bind': 'text: categoryName'})
rows_comment = soup.find_all(attrs={'data-bind': 'text: comment'})
rows_money = soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})
data = [
 {
 "dates": date.get_text(),
 "category": category.get_text(),
 "comment": comment.get_text(),
 "money": money.get_text()
 }
 for date, category, comment, money in zip(rows_dates, rows_category, rows_comment, rows_money)
]

Or, you can do it a bit differently - zipping the lists of texts and specifying the dataframe headers via columns argument:

rows_dates = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'momentDateText: date'})]
rows_category = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: categoryName'})]
rows_comment = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'text: comment'})]
rows_money = [item.get_text() for item in soup.find_all(attrs={'data-bind': 'currencyText: nationalAmount'})]
data = list(zip(rows_dates, rows_category, rows_comment, rows_money))
df = pd.DataFrame(data, columns=["dates", "category", "comment", "money"])
df = pd.DataFrame(data)
answered Dec 31, 2017 at 17:56
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.