2
\$\begingroup\$

I have a data frame that has several columns, one of which contains html objects (containing tables). I want a column of table arrays.

My problem is that this piece of code takes a long time to run. Is there any way I can optimize this? I tried list comprehension, which doesn't significantly improve run time.

Some suggested that I restructure the logic. Any suggestions how?

 df = htmldf
 countries = dict(countries_for_language('en'))
 countrylist = list(countries.values())
 arrayoftableswithcountry = []
 arrayofhtmltables = []
 for idx, row in df.iterrows():
 #print("We are now at row ", idx+1, "of", len(df),".")
 inner_html = tostring(row['html'])
 soup = bs(inner_html,'lxml')
 tableswithcountry = []
 outputr = []
 for idex,item in enumerate(soup.select('table')):
 #print("Extracting", idex+1, "of", len(soup.select('table')),".")
 table = soup.select('table')[idex]
 rows = table.find_all('tr')
 output = []
 outputrows = []
 for row in rows:
 cols = row.find_all('td')
 cols = [item.text.strip() for item in cols]
 output.append([item for item in cols if item])
 if methodsname == 'revseg_geo':
 if '$' in str(output):
 for country in countrylist:
 if country in str(output):
 tableswithcountry.append(output)
 outputr.append(table)
 arrayoftableswithcountry.append(tableswithcountry)
 arrayofhtmltables.append(outputr)
 df['arrayoftables'] = arrayoftableswithcountry
 df['arrayofhtmltables'] = arrayofhtmltables
 print('Made array of tables.')
 df.drop(columns=['html'])
asked May 26, 2021 at 6:00
\$\endgroup\$
3
  • 5
    \$\begingroup\$ The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How do I ask a good question?. \$\endgroup\$ Commented May 26, 2021 at 6:34
  • \$\begingroup\$ Please show an example of the data you're processing. \$\endgroup\$ Commented May 26, 2021 at 21:09
  • \$\begingroup\$ Your question as posted is borderline off-topic. (For our "unclear what you're asking" reason.) Questions on Code Review must contain a description (in English) of what your code is doing. I wanted to edit your title to resolve BCdotWEB's comment, however you've not provided me with the information to do so. \$\endgroup\$ Commented May 26, 2021 at 21:14

2 Answers 2

2
\$\begingroup\$

sensible names

Thesearenotgreatidentifiers:

 arrayoftableswithcountry = []
 arrayofhtmltables = []

Painful though it is to read camelCase python_code, even that would be preferable to this exercise in picking out the various word boundaries. It's easier to do with German nouns than in English.

I'm just going to pretend the for idx, row ... loop is exdented four spaces. Maybe there was some copy-n-paste difficulty.

You didn't show me def tostring(..., nor an import. I will just assume that it computes a result "fast". Maybe you intended to tell us from lxml.etree import tostring?

extract helper

The for idx, row ... loop should definitely be invoking a helper function. If nothing else, it would let the code creep a few spaces closer to the left margin.

It would also mitigate the need for this sort of naming nonsense:

 for idx, row in ...
 for idex, item in ...

algorithm

This is crazy:

 for idx, row in ...
 for idex, item in ...
 if '$' in str(output):
 for country in countrylist:
 if country in str(output): ...

this piece of code takes a long time to run

output is "big", so str(output) is expensive. And the in operator has cost linear with size of its input. And we're performing such operations within inner loops. Don't do that.

Maintain flags for '$' and for country being present in strings as they get added to output. Then we can cheaply consult the flag, without repeatedly scanning and re-scanning the giant output.

answered Apr 29, 2024 at 0:53
\$\endgroup\$
2
\$\begingroup\$

Avoid Multiple BeautifulSoup Calls: Instead of repeatedly calling soup.select('table') inside your loop, call it once per iteration and use the result. This reduces the number of times BeautifulSoup parses the HTML.

Use Vectorized Operations: Replace the loop over the DataFrame with pandas’ apply() method, which is more efficient. Apply a function that processes each HTML content once and extracts the necessary data.

Minimize Operations Inside Loops: Simplify your list comprehensions and reduce the complexity inside your loops. For instance, avoid nested loops where possible and combine steps.

Here’s a quick code snippet to demonstrate using apply():

def extract_data(html_content):
 soup = bs(html_content, 'lxml')
 tables = soup.select('table')
 # Process your tables here and return the necessary data
df['processed_data'] = df['html'].apply(extract_data)

These changes should help speed up your data processing significantly.

toolic
14.6k5 gold badges29 silver badges204 bronze badges
answered Apr 29, 2024 at 11:44
\$\endgroup\$
1

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.