3
\$\begingroup\$

When scraping and saving data into a file, Which method is more efficient when saving scraped data to a file?

  1. open the file first, scrape, and save the data all while the file is opened, or
  2. store the data into a dictionary and then save it to the file?

For example, the following two scripts scrape data from Yahoo Finance. In the first method the file is opened first, the data is scraped and saved to file while the file is opened.

import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
 session = HTMLSession()
 r = session.get(url)
 r.raise_for_status()
 return r.html
# Opening the file first
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
 dictWriter = csv.DictWriter(file,
 fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
 quoting=csv.QUOTE_MINIMAL,
 quotechar="'"
 )
 dictWriter.writeheader()
 content = get_page(URL)
 table_rows = content.find('tbody', first=True).find('tr')
 for row in table_rows:
 symbol = row.find('td')[0].text
 name = row.find('td')[1].text
 last_price = row.find('td')[2].text
 change = float(row.find('td')[3].text.lstrip('+'))
 percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
 data = {'Symbol': symbol,
 'Name': name,
 'lastPrice': last_price,
 'Change': change,
 'percentChange': percent_change}
 
 # Saving data
 dictWriter.writerow(data)

In the second method, the data is scraped, saved to a list, and then the data is written to csv file.

import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
 session = HTMLSession()
 r = session.get(url)
 r.raise_for_status()
 return r.html
content = get_page(URL)
table_rows = content.find('tbody', first=True).find('tr')
records = []
for row in table_rows:
 symbol = row.find('td')[0].text
 name = row.find('td')[1].text
 last_price = row.find('td')[2].text
 change = float(row.find('td')[3].text.lstrip('+'))
 percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
 data = {'Symbol': symbol,
 'Name': name,
 'lastPrice': last_price,
 'Change': change,
 'percentChange': percent_change}
 # Saving data to list:
 records.append(data)
# Then opening the file:
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
 dictWriter = csv.DictWriter(file,
 fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
 quoting=csv.QUOTE_MINIMAL,
 quotechar="'"
 )
 
 dictWriter.writeheader()
 for row in records:
 # Saving data to file: 
 dictWriter.writerow(row)

Questions:

  1. Is the first method more efficient because it skips the need for appending to a list, and the extra for loop to save the data to a file?

  2. Are there any taboos for performing operations on an open file within a context manager as in the first method?

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Oct 5, 2022 at 16:21
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Greybeard jests: it's "scraping" and "scrape" (extract) not "scrapping" and "scrap" (throw away). People get this wrong more often than not :-( \$\endgroup\$ Commented Oct 5, 2022 at 17:11
  • \$\begingroup\$ Because that is the primary purposes of web scraping, to scrape and save the data, correct? Now, with all due respect, please help me understand the reason why I am saving data is relevant to answering my question - Is this the wrong platform to ask these type of questions? Thanks! \$\endgroup\$ Commented Oct 5, 2022 at 17:44
  • \$\begingroup\$ @Seraph776 Did you see my remark explaining greybeard's comment? The words you're using in your post mean something different than you think. You're unintentionally asking how to throw away the data. While you're bothering to edit, I suggest avoiding excessive monospace code blocks on words that aren't code. \$\endgroup\$ Commented Oct 5, 2022 at 18:52

2 Answers 2

5
\$\begingroup\$

Neither? Delete everything and write

import pandas as pd
df, = pd.read_html('https://finance.yahoo.com/lookup/')
df.to_csv('yahoo_finance.csv')

This produces perfectly reasonable output, and I do not believe that your case calls for the complexity of your current code. Pandas has sane internals when it comes to file I/O.

answered Apr 5, 2023 at 0:14
\$\endgroup\$
1
\$\begingroup\$

In your first case (writing to file immediately):

  • File data is buffered and flushed periodically
  • If your crawl crashes mid-way, you'll end up with a partially-written file
  • Constant memory size (besides whatever you're downloading and scraping from)
  • One logical chunk

In your second case (save to list, then write to file all at once):

  • File data is buffered, but writing happens as rapidly as possible
  • If your crawl crashes mid-way, you'll save no results and won't touch the output file
  • Unknown memory size
  • Two logical chunks (crawling, then saving)

What's best depends on your particular use case. However, since you're writing your file using the same code either way, there's no overhead you avoid by pre-building your row list.

In general, I'd say there are more advantages to the first approach: giving you a single loop rather than two, and not needing an intermediate data object of unknown size.

The main advantage of the second approach is that you might be able to use lower-overhead approaches to writing the file:

dictWriter.writeheader()
dictWriter.writerows(records)

Also, some file serialization methods require you to have the entire file ready to write up front (e.g., json.dump or if you used Pandas to structure and save your data), in which case the second method is pretty much your only choice.

In summary, beyond mere preference, I'd say it depends on two main factors:

  • Is your intermediate data potentially large? If so, try to use method 1.
  • Does your file serialization method support iterative saving? If not, you'll need to use method 2.
answered Oct 5, 2022 at 20:23
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.