Saving Scraped Data to a File

Question 1

When scraping and saving data into a file, Which method is more efficient when saving scraped data to a file?

open the file first, scrape, and save the data all while the file is opened, or
store the data into a dictionary and then save it to the file?

For example, the following two scripts scrape data from Yahoo Finance. In the first method the file is opened first, the data is scraped and saved to file while the file is opened.

import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
 session = HTMLSession()
 r = session.get(url)
 r.raise_for_status()
 return r.html
# Opening the file first
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
 dictWriter = csv.DictWriter(file,
 fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
 quoting=csv.QUOTE_MINIMAL,
 quotechar="'"
 )
 dictWriter.writeheader()
 content = get_page(URL)
 table_rows = content.find('tbody', first=True).find('tr')
 for row in table_rows:
 symbol = row.find('td')[0].text
 name = row.find('td')[1].text
 last_price = row.find('td')[2].text
 change = float(row.find('td')[3].text.lstrip('+'))
 percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
 data = {'Symbol': symbol,
 'Name': name,
 'lastPrice': last_price,
 'Change': change,
 'percentChange': percent_change}
 
 # Saving data
 dictWriter.writerow(data)

In the second method, the data is scraped, saved to a list, and then the data is written to csv file.

import csv
from requests_html import HTMLSession
URL = 'https://finance.yahoo.com/lookup/'
def get_page(url):
 session = HTMLSession()
 r = session.get(url)
 r.raise_for_status()
 return r.html
content = get_page(URL)
table_rows = content.find('tbody', first=True).find('tr')
records = []
for row in table_rows:
 symbol = row.find('td')[0].text
 name = row.find('td')[1].text
 last_price = row.find('td')[2].text
 change = float(row.find('td')[3].text.lstrip('+'))
 percent_change = float(row.find('td')[4].text.lstrip('+').rstrip('%'))
 data = {'Symbol': symbol,
 'Name': name,
 'lastPrice': last_price,
 'Change': change,
 'percentChange': percent_change}
 # Saving data to list:
 records.append(data)
# Then opening the file:
with open('yahoo_finance.csv', 'w', newline='', encoding='utf8') as file:
 dictWriter = csv.DictWriter(file,
 fieldnames=['Symbol', 'Name', 'lastPrice', 'Change', 'percentChange'],
 quoting=csv.QUOTE_MINIMAL,
 quotechar="'"
 )
 
 dictWriter.writeheader()
 for row in records:
 # Saving data to file: 
 dictWriter.writerow(row)

Questions:

Is the first method more efficient because it skips the need for appending to a list, and the extra for loop to save the data to a file?
Are there any taboos for performing operations on an open file within a context manager as in the first method?

Question 2

Greybeard jests: it's "scraping" and "scrape" (extract) not "scrapping" and "scrap" (throw away). People get this wrong more often than not :-(

Question 3

Because that is the primary purposes of web scraping, to scrape and save the data, correct? Now, with all due respect, please help me understand the reason why I am saving data is relevant to answering my question - Is this the wrong platform to ask these type of questions? Thanks!

Question 4

@Seraph776 Did you see my remark explaining greybeard's comment? The words you're using in your post mean something different than you think. You're unintentionally asking how to throw away the data. While you're bothering to edit, I suggest avoiding excessive monospace code blocks on words that aren't code.

Question 5

Neither? Delete everything and write

import pandas as pd
df, = pd.read_html('https://finance.yahoo.com/lookup/')
df.to_csv('yahoo_finance.csv')

This produces perfectly reasonable output, and I do not believe that your case calls for the complexity of your current code. Pandas has sane internals when it comes to file I/O.

Question 6

In your first case (writing to file immediately):

File data is buffered and flushed periodically
If your crawl crashes mid-way, you'll end up with a partially-written file
Constant memory size (besides whatever you're downloading and scraping from)
One logical chunk

In your second case (save to list, then write to file all at once):

File data is buffered, but writing happens as rapidly as possible
If your crawl crashes mid-way, you'll save no results and won't touch the output file
Unknown memory size
Two logical chunks (crawling, then saving)

What's best depends on your particular use case. However, since you're writing your file using the same code either way, there's no overhead you avoid by pre-building your row list.

In general, I'd say there are more advantages to the first approach: giving you a single loop rather than two, and not needing an intermediate data object of unknown size.

The main advantage of the second approach is that you might be able to use lower-overhead approaches to writing the file:

dictWriter.writeheader()
dictWriter.writerows(records)

Also, some file serialization methods require you to have the entire file ready to write up front (e.g., json.dump or if you used Pandas to structure and save your data), in which case the second method is pretty much your only choice.

In summary, beyond mere preference, I'd say it depends on two main factors:

Is your intermediate data potentially large? If so, try to use method 1.
Does your file serialization method support iterative saving? If not, you'll need to use method 2.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2023-04-05 00:14:19Z

Neither? Delete everything and write

import pandas as pd
df, = pd.read_html('https://finance.yahoo.com/lookup/')
df.to_csv('yahoo_finance.csv')

This produces perfectly reasonable output, and I do not believe that your case calls for the complexity of your current code. Pandas has sane internals when it comes to file I/O.

scnerd scnerd 2,0607 silver badges12 bronze badges · Answer 2 · 2022-10-05 20:23:27Z

In your first case (writing to file immediately):

File data is buffered and flushed periodically
If your crawl crashes mid-way, you'll end up with a partially-written file
Constant memory size (besides whatever you're downloading and scraping from)
One logical chunk

In your second case (save to list, then write to file all at once):

File data is buffered, but writing happens as rapidly as possible
If your crawl crashes mid-way, you'll save no results and won't touch the output file
Unknown memory size
Two logical chunks (crawling, then saving)

What's best depends on your particular use case. However, since you're writing your file using the same code either way, there's no overhead you avoid by pre-building your row list.

In general, I'd say there are more advantages to the first approach: giving you a single loop rather than two, and not needing an intermediate data object of unknown size.

The main advantage of the second approach is that you might be able to use lower-overhead approaches to writing the file:

dictWriter.writeheader()
dictWriter.writerows(records)

Also, some file serialization methods require you to have the entire file ready to write up front (e.g., json.dump or if you used Pandas to structure and save your data), in which case the second method is pretty much your only choice.

In summary, beyond mere preference, I'd say it depends on two main factors:

Is your intermediate data potentially large? If so, try to use method 1.
Does your file serialization method support iterative saving? If not, you'll need to use method 2.

Stack Exchange Network

Saving Scraped Data to a File

Questions:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Saving Scraped Data to a File

Questions:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions