A task:
Сollect data from the site in the following format:
book; user; book_rating; comment_rating; publication_date; comment
For one book at once several pages of reviews (or more than one).
Problem:
One request to the site can be sent once every 0.25 seconds, so async requests don't work.
Question:
Can data collection be accelerated?
Сode structure:
Link to the page with the book from Top_link.txt
→ Get links to pages with reviews of this book (function score_link)
→ Get each review in a loop (function score_user)
→ Collect data from the review (function score_user)
Code:
import os
import requests
from fake_useragent import UserAgent
from selectolax.lexbor import LexborHTMLParser
from time import sleep
from tqdm import tqdm
# The function concatenates the current directory and the file name
def path_to_file(name):
return os.path.join(os.path.dirname(__file__), name)
# Read links to sites from top_link.txt
with open(path_to_file('top_link.txt'), 'r', encoding="utf-8") as f:
text = f.read()
book_id = [int(element.strip("'{}")) for element in text.split(", ")]
sites = [f"https://fantlab.ru/work{i}" for i in sorted(book_id)]
# Activate UserAgent
useragent = UserAgent()
# Get the html page and request response status.
def get_html(url):
headers = {"Accept": "*/*", "User-Agent": useragent.random}
# Establish a permanent connection
session = requests.Session()
session.headers = headers
adapter = requests.adapters.HTTPAdapter(pool_connections=100,
pool_maxsize=100)
session.mount('http://', adapter)
resp = requests.get(url, headers=headers)
html = resp.text
return html, resp.status_code
# Get links to review pages
def score_link(html, url):
tree = LexborHTMLParser(html)
tree_users_list = tree.css_first(r'span.page-links')
link_list = []
# Users without this element have no reviews
if tree_users_list is not None:
tree_users = tree_users_list.css(r'a')
for user in tree_users:
# Link to comment page
link = url + user.attributes['href']
link_list.append(link)
return link_list
else:
link_list.append(url)
return link_list
# Get user feedback
def score_user(links):
score_list = []
# Follow links to review pages
for url in links:
html, status_code = get_html(url)
tree = LexborHTMLParser(html)
# Check server response
if status_code == 200:
score = tree.css("div.responses-list > div.response-item")
if score is not None:
# Go through reviews
for user in score:
book_link = url.split('?')[0]
user_id = user.css_first(
r'p.response-autor-info>b>a').attributes['href']
book_rating = user.css_first(
r'div.clearfix>div.response-autor-mark>b>span').text()
comment_rating = user.css_first(
r'div.response-votetab>span:nth-of-type(2)').text()
data_score = user.css_first(
r'p.response-autor-info>span').attributes['content']
body_score = user.css_first(
r'div.response-body-home').text().replace('\n', ' ')
score_list.append(
f'{book_link};{user_id};{book_rating};{comment_rating};{data_score};{body_score}\n'
)
elif status_code == 429:
sleep(1)
print('ERROE_429:', url)
sleep(0.25)
return score_list
with open(path_to_file("user.csv"), "a+", encoding='utf-8') as file:
file.write(
"book; user; book_rating; rating_rating; publication_date; comment \n"
)
for url in tqdm(sites):
html, status_code = get_html(url)
line = ''.join(score_user(score_link(html, url)))
if line is not None:
file.write(line)
sleep(0.5)
1 Answer 1
See my comment under your post for why your question can hardly be answered accurately right now.
What I like about your code:
- Your main loop is nice and short, well done!
- You have a couple of functions that do one thing -- good!
- You use libraries appropriately!
Here's some general advice.
- You import
os
to resolve the absolute path oftop_link.txt
, which is in the same path as your script. Assuming the working directory is the one where your script resides, this is unnecessary. You can removepath_to_file
andimport os
. This will speed up the startup of your script. - You have function definitions and code in between each other. Create a
main
function with aif __name__ == "__main__"
guard. I often place that at the top of all the function definitions, below imports and constants. That way I see the entry-point to your script easily. - Since you ask for performance: You have a
sleep(0.5)
at the very bottom of the script, outside of thefor
-loop above it. I cannot think of a reason where that's necessary except if you start your script by double-clicking and want to wait half a second between main routine is finished and the window closing. - You
user.csv
in read+append mode. Then you write a header. A header in CSV only makes sense in the first line of the file. That would mean you write a new file, otherwise you duplicate the header. I would recommend checking if the file exists withpathlib.Path
(os.path
works, but is not pythonic). Also, you never actually read from the file, soa
mode is fine. - If you use
pathlib.Path
, you can simplify readingtop_link.txt
to one simple function call that comes withPath
. - The beginning of
main
is there to createsites
. After this is generated, you don't needtext
andbook_id
anymore. That is perfect to create a function. if line is not None
: this can never be the case, since"".join([])
is astr
of length 0. So you probably want to check that you have a non-empty string. Then you can just writeif line:
.- The same argument goes for
tree.css_first
. From the signature of that function it does not look like it could returnNone
, but I'm not familiar with the library to be sure here. - You have three different
sleep
times in your code, namely 0.25, 0.5, and 1. You only explained where one of these come from. This is an anti-pattern called "magic numbers". Replace them by constants with a good name. - Your code had two ZWSPs (zero-width spaces) in it, which upset part of my IDE. I have removed them.
- Now to the meat of the script:
score_user
has 5 levels of indentation. That is very hard to read, and often a sign of convoluted code. Especially if you have nested ifs, you can reverse the logic of the conditions andcontinue
the loop orreturn
from the function early. That makes it clear what's happening in these cases without having to read the body of the big loop. I refactored that in a slightly unconventional way. I first checked if the status code is not 200 or 429 and handled that. Then I handled the case of 429. Then the rest of the function handles 200. - Calculating the individual values of
score_list
looks fairly complex, and therefore deserves its own function.
If you apply these recommendations, the code looks like this.
from pathlib import Path
from typing import List
import requests
from requests import adapters
from fake_useragent import UserAgent
from selectolax.lexbor import LexborHTMLParser
from time import sleep
from tqdm import tqdm
RATE_LIMIT_DELAY = 1
MINIMUM_DELAY_BETWEEN_REQUESTS = 0.25
USER_CSV_PATH = Path("user.csv")
TOP_LINK_PATH = Path("top_link.txt")
CSV_HEADER = "book; user; book_rating; rating_rating; publication_date; comment \n"
# Activate UserAgent
useragent = UserAgent()
def main():
sites = generate_sites_from_file(TOP_LINK_PATH)
if not USER_CSV_PATH.exists():
create_file_with_header(USER_CSV_PATH)
with open("user.csv", "a", encoding="utf-8") as file:
for url in tqdm(sites):
html, status_code = get_html(url)
line = "".join(score_user(score_link(html, url)))
if line:
file.write(line)
# Get the html page and request response status.
def get_html(url):
headers = {"Accept": "*/*", "User-Agent": useragent.random}
# Establish a permanent connection
session = requests.Session()
session.headers = headers
adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount("http://", adapter)
resp = requests.get(url, headers=headers)
html = resp.text
return html, resp.status_code
# Get links to review pages
def score_link(html, url):
tree = LexborHTMLParser(html)
tree_users_list = tree.css_first(r"span.page-links")
link_list = []
# Users without this element have no reviews
if tree_users_list is not None:
tree_users = tree_users_list.css(r"a")
for user in tree_users:
# Link to comment page
link = url + user.attributes["href"]
link_list.append(link)
return link_list
else:
link_list.append(url)
return link_list
# Get user feedback
def score_user(links):
score_list = []
# Follow links to review pages
for url in links:
html, status_code = get_html(url)
tree = LexborHTMLParser(html)
# Check server response
if status_code != 200:
print(f'ERROR_{status_code}:{url}')
sleep(RATE_LIMIT_DELAY)
continue
# now status_code must be 200
score = tree.css("div.responses-list > div.response-item")
if score is not None:
# Go through reviews
for user in score:
score_list.append(generate_score_list_entry(url, user))
sleep(MINIMUM_DELAY_BETWEEN_REQUESTS)
return score_list
def generate_score_list_entry(url, user):
book_link = url.split("?")[0]
user_id = user.css_first(r"p.response-autor-info>b>a").attributes["href"]
book_rating = user.css_first(r"div.clearfix>div.response-autor-mark>b>span").text()
comment_rating = user.css_first(r"div.response-votetab>span:nth-of-type(2)").text()
data_score = user.css_first(r"p.response-autor-info>span").attributes["content"]
body_score = user.css_first(r"div.response-body-home").text().replace("\n", " ")
value = f"{book_link};{user_id};{book_rating};{comment_rating};{data_score};{body_score}\n"
return value
def create_file_with_header(path: Path) -> None:
with open(path, "w") as fd:
fd.write(CSV_HEADER)
def generate_sites_from_file(path: Path) -> List[str]:
text = path.read_text()
book_id = [int(element.strip("'{}")) for element in text.split(", ")]
return [f"https://fantlab.ru/work{i}" for i in sorted(book_id)]
if __name__ == "__main__":
main()
-
1\$\begingroup\$ Thank you for the precise and quick answer, I picked up a lot of useful information. \$\endgroup\$Pom Mop– Pom Mop2022年08月13日 08:36:06 +00:00Commented Aug 13, 2022 at 8:36
-
\$\begingroup\$
ERROE_429
should beERROR_429
I assume. \$\endgroup\$AcK– AcK2022年08月13日 09:53:05 +00:00Commented Aug 13, 2022 at 9:53 -
\$\begingroup\$ That's how it was written in the original post. I kept it as is. \$\endgroup\$sarema– sarema2022年08月13日 09:59:38 +00:00Commented Aug 13, 2022 at 9:59
top_link.txt
is missing. \$\endgroup\$