When scraping and saving data into a file, Which method is more efficient when saving scraped data to a file?
- open the
file
first, scrape, and save the data all while the file is opened, or - store the data into a dictionary and then save it to the
file
?
For example, the following two scripts scrape data from Yahoo Finance. In the first method the file
is opened first, the data is scraped and saved to file
while the file
is opened.
import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
session = HTMLSession()
r = session.get(url)
r.raise_for_status()
return r.html
# Opening the file first
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
dictWriter = csv.DictWriter(file,
fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
quoting=csv.QUOTE_MINIMAL,
quotechar="'"
)
dictWriter.writeheader()
content = get_page(URL)
table_rows = content.find('tbody', first=True).find('tr')
for row in table_rows:
symbol = row.find('td')[0].text
name = row.find('td')[1].text
last_price = row.find('td')[2].text
change = float(row.find('td')[3].text.lstrip('+'))
percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
data = {'Symbol': symbol,
'Name': name,
'lastPrice': last_price,
'Change': change,
'percentChange': percent_change}
# Saving data
dictWriter.writerow(data)
In the second method, the data is scraped, saved to a list, and then the data is written to csv file
.
import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
session = HTMLSession()
r = session.get(url)
r.raise_for_status()
return r.html
content = get_page(URL)
table_rows = content.find('tbody', first=True).find('tr')
records = []
for row in table_rows:
symbol = row.find('td')[0].text
name = row.find('td')[1].text
last_price = row.find('td')[2].text
change = float(row.find('td')[3].text.lstrip('+'))
percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
data = {'Symbol': symbol,
'Name': name,
'lastPrice': last_price,
'Change': change,
'percentChange': percent_change}
# Saving data to list:
records.append(data)
# Then opening the file:
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
dictWriter = csv.DictWriter(file,
fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
quoting=csv.QUOTE_MINIMAL,
quotechar="'"
)
dictWriter.writeheader()
for row in records:
# Saving data to file:
dictWriter.writerow(row)
Questions:
Is the first method more efficient because it skips the need for
appending
to a list, and the extrafor loop
to save the data to a file?Are there any taboos for performing operations on an
open
file within a context manager as in the first method?
2 Answers 2
Neither? Delete everything and write
import pandas as pd
df, = pd.read_html('https://finance.yahoo.com/lookup/')
df.to_csv('yahoo_finance.csv')
This produces perfectly reasonable output, and I do not believe that your case calls for the complexity of your current code. Pandas has sane internals when it comes to file I/O.
In your first case (writing to file immediately):
- File data is buffered and flushed periodically
- If your crawl crashes mid-way, you'll end up with a partially-written file
- Constant memory size (besides whatever you're downloading and scraping from)
- One logical chunk
In your second case (save to list, then write to file all at once):
- File data is buffered, but writing happens as rapidly as possible
- If your crawl crashes mid-way, you'll save no results and won't touch the output file
- Unknown memory size
- Two logical chunks (crawling, then saving)
What's best depends on your particular use case. However, since you're writing your file using the same code either way, there's no overhead you avoid by pre-building your row list.
In general, I'd say there are more advantages to the first approach: giving you a single loop rather than two, and not needing an intermediate data object of unknown size.
The main advantage of the second approach is that you might be able to use lower-overhead approaches to writing the file:
dictWriter.writeheader()
dictWriter.writerows(records)
Also, some file serialization methods require you to have the entire file ready to write up front (e.g., json.dump
or if you used Pandas to structure and save your data), in which case the second method is pretty much your only choice.
In summary, beyond mere preference, I'd say it depends on two main factors:
- Is your intermediate data potentially large? If so, try to use method 1.
- Does your file serialization method support iterative saving? If not, you'll need to use method 2.
Explore related questions
See similar questions with these tags.
scraping
, toscrape
andsave
the data, correct? Now, with all due respect, please help me understand the reasonwhy
I am saving data is relevant to answering my question - Is this the wrong platform to ask these type of questions? Thanks! \$\endgroup\$monospace code blocks
on words that aren't code. \$\endgroup\$