How do I make my html object-to-table array script run faster? (Pandas)

Question 1

I have a data frame that has several columns, one of which contains html objects (containing tables). I want a column of table arrays.

My problem is that this piece of code takes a long time to run. Is there any way I can optimize this? I tried list comprehension, which doesn't significantly improve run time.

Some suggested that I restructure the logic. Any suggestions how?

 df = htmldf
 countries = dict(countries_for_language('en'))
 countrylist = list(countries.values())
 arrayoftableswithcountry = []
 arrayofhtmltables = []
 for idx, row in df.iterrows():
 #print("We are now at row ", idx+1, "of", len(df),".")
 inner_html = tostring(row['html'])
 soup = bs(inner_html,'lxml')
 tableswithcountry = []
 outputr = []
 for idex,item in enumerate(soup.select('table')):
 #print("Extracting", idex+1, "of", len(soup.select('table')),".")
 table = soup.select('table')[idex]
 rows = table.find_all('tr')
 output = []
 outputrows = []
 for row in rows:
 cols = row.find_all('td')
 cols = [item.text.strip() for item in cols]
 output.append([item for item in cols if item])
 if methodsname == 'revseg_geo':
 if '$' in str(output):
 for country in countrylist:
 if country in str(output):
 tableswithcountry.append(output)
 outputr.append(table)
 arrayoftableswithcountry.append(tableswithcountry)
 arrayofhtmltables.append(outputr)
 df['arrayoftables'] = arrayoftableswithcountry
 df['arrayofhtmltables'] = arrayofhtmltables
 print('Made array of tables.')
 df.drop(columns=['html'])

Question 2

The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How do I ask a good question?.

Question 3

Please show an example of the data you're processing.

Question 4

Your question as posted is borderline off-topic. (For our "unclear what you're asking" reason.) Questions on Code Review must contain a description (in English) of what your code is doing. I wanted to edit your title to resolve BCdotWEB's comment, however you've not provided me with the information to do so.

Question 5

sensible names

Thesearenotgreatidentifiers:

 arrayoftableswithcountry = []
 arrayofhtmltables = []

Painful though it is to read camelCase python_code, even that would be preferable to this exercise in picking out the various word boundaries. It's easier to do with German nouns than in English.

I'm just going to pretend the for idx, row ... loop is exdented four spaces. Maybe there was some copy-n-paste difficulty.

You didn't show me def tostring(..., nor an import. I will just assume that it computes a result "fast". Maybe you intended to tell us from lxml.etree import tostring?

extract helper

The for idx, row ... loop should definitely be invoking a helper function. If nothing else, it would let the code creep a few spaces closer to the left margin.

It would also mitigate the need for this sort of naming nonsense:

 for idx, row in ...
 for idex, item in ...

algorithm

This is crazy:

 for idx, row in ...
 for idex, item in ...
 if '$' in str(output):
 for country in countrylist:
 if country in str(output): ...

this piece of code takes a long time to run

output is "big", so str(output) is expensive. And the in operator has cost linear with size of its input. And we're performing such operations within inner loops. Don't do that.

Maintain flags for '$' and for country being present in strings as they get added to output. Then we can cheaply consult the flag, without repeatedly scanning and re-scanning the giant output.

Question 6

Avoid Multiple BeautifulSoup Calls: Instead of repeatedly calling soup.select('table') inside your loop, call it once per iteration and use the result. This reduces the number of times BeautifulSoup parses the HTML.

Use Vectorized Operations: Replace the loop over the DataFrame with pandas’ apply() method, which is more efficient. Apply a function that processes each HTML content once and extracts the necessary data.

Minimize Operations Inside Loops: Simplify your list comprehensions and reduce the complexity inside your loops. For instance, avoid nested loops where possible and combine steps.

Here’s a quick code snippet to demonstrate using apply():

def extract_data(html_content):
 soup = bs(html_content, 'lxml')
 tables = soup.select('table')
 # Process your tables here and return the necessary data
df['processed_data'] = df['html'].apply(extract_data)

These changes should help speed up your data processing significantly.

Question 7

FYI pandas apply() is not vectorization. It's just fancier iteration and is not (much) faster than looping.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2024-04-29 00:53:09Z

sensible names

Thesearenotgreatidentifiers:

 arrayoftableswithcountry = []
 arrayofhtmltables = []

Painful though it is to read camelCase python_code, even that would be preferable to this exercise in picking out the various word boundaries. It's easier to do with German nouns than in English.

I'm just going to pretend the for idx, row ... loop is exdented four spaces. Maybe there was some copy-n-paste difficulty.

You didn't show me def tostring(..., nor an import. I will just assume that it computes a result "fast". Maybe you intended to tell us from lxml.etree import tostring?

extract helper

The for idx, row ... loop should definitely be invoking a helper function. If nothing else, it would let the code creep a few spaces closer to the left margin.

It would also mitigate the need for this sort of naming nonsense:

 for idx, row in ...
 for idex, item in ...

algorithm

This is crazy:

 for idx, row in ...
 for idex, item in ...
 if '$' in str(output):
 for country in countrylist:
 if country in str(output): ...

this piece of code takes a long time to run

output is "big", so str(output) is expensive. And the in operator has cost linear with size of its input. And we're performing such operations within inner loops. Don't do that.

Maintain flags for '$' and for country being present in strings as they get added to output. Then we can cheaply consult the flag, without repeatedly scanning and re-scanning the giant output.

Shamp Phili Shamp Phili 1173 bronze badges · Answer 2 · 2024-04-29 11:44:43Z

Avoid Multiple BeautifulSoup Calls: Instead of repeatedly calling soup.select('table') inside your loop, call it once per iteration and use the result. This reduces the number of times BeautifulSoup parses the HTML.

Use Vectorized Operations: Replace the loop over the DataFrame with pandas’ apply() method, which is more efficient. Apply a function that processes each HTML content once and extracts the necessary data.

Minimize Operations Inside Loops: Simplify your list comprehensions and reduce the complexity inside your loops. For instance, avoid nested loops where possible and combine steps.

Here’s a quick code snippet to demonstrate using apply():

def extract_data(html_content):
 soup = bs(html_content, 'lxml')
 tables = soup.select('table')
 # Process your tables here and return the necessary data
df['processed_data'] = df['html'].apply(extract_data)

These changes should help speed up your data processing significantly.

FYI pandas apply() is not vectorization. It's just fancier iteration and is not (much) faster than looping.

Stack Exchange Network

How do I make my html object-to-table array script run faster? (Pandas)

2 Answers 2

sensible names

extract helper

algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How do I make my html object-to-table array script run faster? (Pandas)

2 Answers 2

sensible names

extract helper

algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions