How to clean up the data from this webscraping script?

Question 1

So here is my code:

import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)

I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:

<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>

I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:

for tag in soup():
 for attribute in ["class", "id", "name", "style", "td", "tr"]:
 del tag[attribute]

Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.

So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

Question 2

I'm not familiar with Bleach, but you can just iterate over the table rows and cells: stackoverflow.com/questions/15250455/…

Question 3

...and adapted for your code, with example output: gist.github.com/helb/be5263f7ce9d83e7dbe8c11277363814

Question 4

@helb your gist code worked, but the output isn't the same on my end as it is for you. Yours looks perfectly formatted, but mine looks markedly different. imgur.com/a/RtcHI < image. Not exactly sure why mine looks different. Did I mess up the code?

Question 5

You could just iterate and replace with a regex expression via re.sub()

Question 6

@JSowwy Your code looks okay. It might be either the terminal in your IDE messing up tabs, or an older Python version (end in print() is supported since 3.0, what version do you use?). Working example on repl.it: repl.it/Lwle/1 (loading data from a saved file instead of requests, but the rest is the same)

Question 7

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.

import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
 for list_item in tables.select("tr")] 
for data in list_items:
 print(' '.join(data))

Partial results:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36

SIM 22.5k7 gold badges45 silver badges116 bronze badges · Accepted Answer · 2017-10-02 17:26:50Z

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.

import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
 for list_item in tables.select("tr")] 
for data in list_items:
 print(' '.join(data))

Partial results:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36

CollectivesTM on Stack Overflow

How to clean up the data from this webscraping script?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related