Parsing: Сollecting data from a book site

Question 1

A task:

Сollect data from the site in the following format:
book; user; book_rating; comment_rating; publication_date; comment
For one book at once several pages of reviews (or more than one).

Problem:

One request to the site can be sent once every 0.25 seconds, so async requests don't work.

Question:

Can data collection be accelerated?

Сode structure:

Link to the page with the book from Top_link.txt
→ Get links to pages with reviews of this book (function score_link)
→ Get each review in a loop (function score_user)
→ Collect data from the review (function score_user)

Code:

import os
import requests
from fake_useragent import UserAgent
from selectolax.lexbor import LexborHTMLParser
from time import sleep
from tqdm import tqdm
# The function concatenates the current directory and the file name
def path_to_file(name):
 return os.path.join(os.path.dirname(__file__), name)
# Read links to sites from top_link.txt
with open(path_to_file('top_link.txt'), 'r', encoding="utf-8") as f:
 text = f.read()
book_id = [int(element.strip("'{}")) for element in text.split(", ")]
sites = [f"https://fantlab.ru/work{i}" for i in sorted(book_id)]
# Activate UserAgent
useragent = UserAgent()
# Get the html page and request response status.
def get_html(url):
 headers = {"Accept": "*/*", "User-Agent": useragent.random}
 # Establish a permanent connection
 session = requests.Session()
 session.headers = headers
 adapter = requests.adapters.HTTPAdapter(pool_connections=100,
 pool_maxsize=100)
 session.mount('http://', adapter)
 resp = requests.get(url, headers=headers)
 html = resp.text
 return html, resp.status_code
# Get links to review pages
def score_link(html, url):
 tree = LexborHTMLParser(html)
 tree_users_list = tree.css_first(r'span.page-links')
 link_list = []
 # Users without this element have no reviews
 if tree_users_list is not None:
 tree_users = tree_users_list.css(r'a')
 for user in tree_users:
 # Link to comment page
 link = url + user.attributes['href']
 link_list.append(link)
 return link_list
 else:
 link_list.append(url)
 return link_list
# Get user feedback
def score_user(links):
 score_list = []
 # Follow links to review pages
 for url in links:
 html, status_code = get_html(url)
 tree = LexborHTMLParser(html)
 # Check server response
 if status_code == 200:
 score = tree.css("div.responses-list > div.response-item")
 if score is not None:
 # Go through reviews
 for user in score:
 book_link = url.split('?')[0]
 user_id = user.css_first(
 r'p.response-autor-info>b>a').attributes['href']
 book_rating = user.css_first(
 r'div.clearfix>div.response-autor-mark>b>span').text()
 comment_rating = user.css_first(
 r'div.response-votetab>span:nth-of-type(2)').text()
 data_score = user.css_first(
 r'p.response-autor-info>span').attributes['content']
 body_score = user.css_first(
 r'div.response-body-home').text().replace('\n', ' ')
 score_list.append(
 f'{book_link};{user_id};{book_rating};{comment_rating};{data_score};{body_score}\n'
 )
 elif status_code == 429:
 sleep(1)
 print('ERROE_429:', url)
 sleep(0.25)
 return score_list
with open(path_to_file("user.csv"), "a+", encoding='utf-8') as file:
 file.write(
 "book; user; book_rating; rating_rating; publication_date; comment \n"
 )
 for url in tqdm(sites):
 html, status_code = get_html(url)
 line = ''.join(score_user(score_link(html, url)))
 if line is not None:
 file.write(line)
 sleep(0.5)

Question 2

You ask if data collection can be accelerated. From the information you provide, it's hard to answer, because we cannot actually run the code, because some example data for top_link.txt is missing.

Question 3

See my comment under your post for why your question can hardly be answered accurately right now.

What I like about your code:

Your main loop is nice and short, well done!
You have a couple of functions that do one thing -- good!
You use libraries appropriately!

Here's some general advice.

You import os to resolve the absolute path of top_link.txt, which is in the same path as your script. Assuming the working directory is the one where your script resides, this is unnecessary. You can remove path_to_file and import os. This will speed up the startup of your script.
You have function definitions and code in between each other. Create a main function with a if __name__ == "__main__" guard. I often place that at the top of all the function definitions, below imports and constants. That way I see the entry-point to your script easily.
Since you ask for performance: You have a sleep(0.5) at the very bottom of the script, outside of the for-loop above it. I cannot think of a reason where that's necessary except if you start your script by double-clicking and want to wait half a second between main routine is finished and the window closing.
You user.csv in read+append mode. Then you write a header. A header in CSV only makes sense in the first line of the file. That would mean you write a new file, otherwise you duplicate the header. I would recommend checking if the file exists with pathlib.Path (os.path works, but is not pythonic). Also, you never actually read from the file, so a mode is fine.
If you use pathlib.Path, you can simplify reading top_link.txt to one simple function call that comes with Path.
The beginning of main is there to create sites. After this is generated, you don't need text and book_id anymore. That is perfect to create a function.
if line is not None: this can never be the case, since "".join([]) is a str of length 0. So you probably want to check that you have a non-empty string. Then you can just write if line:.
The same argument goes for tree.css_first. From the signature of that function it does not look like it could return None, but I'm not familiar with the library to be sure here.
You have three different sleep times in your code, namely 0.25, 0.5, and 1. You only explained where one of these come from. This is an anti-pattern called "magic numbers". Replace them by constants with a good name.
Your code had two ZWSPs (zero-width spaces) in it, which upset part of my IDE. I have removed them.
Now to the meat of the script: score_user has 5 levels of indentation. That is very hard to read, and often a sign of convoluted code. Especially if you have nested ifs, you can reverse the logic of the conditions and continue the loop or return from the function early. That makes it clear what's happening in these cases without having to read the body of the big loop. I refactored that in a slightly unconventional way. I first checked if the status code is not 200 or 429 and handled that. Then I handled the case of 429. Then the rest of the function handles 200.
Calculating the individual values of score_list looks fairly complex, and therefore deserves its own function.

If you apply these recommendations, the code looks like this.

from pathlib import Path
from typing import List
import requests
from requests import adapters
from fake_useragent import UserAgent
from selectolax.lexbor import LexborHTMLParser
from time import sleep
from tqdm import tqdm
RATE_LIMIT_DELAY = 1
MINIMUM_DELAY_BETWEEN_REQUESTS = 0.25
USER_CSV_PATH = Path("user.csv")
TOP_LINK_PATH = Path("top_link.txt")
CSV_HEADER = "book; user; book_rating; rating_rating; publication_date; comment \n"
# Activate UserAgent
useragent = UserAgent()
def main():
 sites = generate_sites_from_file(TOP_LINK_PATH)
 if not USER_CSV_PATH.exists():
 create_file_with_header(USER_CSV_PATH)
 with open("user.csv", "a", encoding="utf-8") as file:
 for url in tqdm(sites):
 html, status_code = get_html(url)
 line = "".join(score_user(score_link(html, url)))
 if line:
 file.write(line)
# Get the html page and request response status.
def get_html(url):
 headers = {"Accept": "*/*", "User-Agent": useragent.random}
 # Establish a permanent connection
 session = requests.Session()
 session.headers = headers
 adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
 session.mount("http://", adapter)
 resp = requests.get(url, headers=headers)
 html = resp.text
 return html, resp.status_code
# Get links to review pages
def score_link(html, url):
 tree = LexborHTMLParser(html)
 tree_users_list = tree.css_first(r"span.page-links")
 link_list = []
 # Users without this element have no reviews
 if tree_users_list is not None:
 tree_users = tree_users_list.css(r"a")
 for user in tree_users:
 # Link to comment page
 link = url + user.attributes["href"]
 link_list.append(link)
 return link_list
 else:
 link_list.append(url)
 return link_list
# Get user feedback
def score_user(links):
 score_list = []
 # Follow links to review pages
 for url in links:
 html, status_code = get_html(url)
 tree = LexborHTMLParser(html)
 # Check server response
 if status_code != 200:
 print(f'ERROR_{status_code}:{url}')
 sleep(RATE_LIMIT_DELAY)
 continue 
 # now status_code must be 200
 score = tree.css("div.responses-list > div.response-item")
 if score is not None:
 # Go through reviews
 for user in score:
 score_list.append(generate_score_list_entry(url, user))
 sleep(MINIMUM_DELAY_BETWEEN_REQUESTS)
 return score_list
def generate_score_list_entry(url, user):
 book_link = url.split("?")[0]
 user_id = user.css_first(r"p.response-autor-info>b>a").attributes["href"]
 book_rating = user.css_first(r"div.clearfix>div.response-autor-mark>b>span").text()
 comment_rating = user.css_first(r"div.response-votetab>span:nth-of-type(2)").text()
 data_score = user.css_first(r"p.response-autor-info>span").attributes["content"]
 body_score = user.css_first(r"div.response-body-home").text().replace("\n", " ")
 value = f"{book_link};{user_id};{book_rating};{comment_rating};{data_score};{body_score}\n"
 return value
def create_file_with_header(path: Path) -> None:
 with open(path, "w") as fd:
 fd.write(CSV_HEADER)
def generate_sites_from_file(path: Path) -> List[str]:
 text = path.read_text()
 book_id = [int(element.strip("'{}")) for element in text.split(", ")]
 return [f"https://fantlab.ru/work{i}" for i in sorted(book_id)]
if __name__ == "__main__":
 main()

Question 4

Thank you for the precise and quick answer, I picked up a lot of useful information.

Question 5

ERROE_429 should be ERROR_429 I assume.

Question 6

That's how it was written in the original post. I kept it as is.

sarema sarema 4992 silver badges10 bronze badges · Accepted Answer · 2022-08-13 02:39:26Z

See my comment under your post for why your question can hardly be answered accurately right now.

What I like about your code:

Your main loop is nice and short, well done!
You have a couple of functions that do one thing -- good!
You use libraries appropriately!

Here's some general advice.

You import os to resolve the absolute path of top_link.txt, which is in the same path as your script. Assuming the working directory is the one where your script resides, this is unnecessary. You can remove path_to_file and import os. This will speed up the startup of your script.
You have function definitions and code in between each other. Create a main function with a if __name__ == "__main__" guard. I often place that at the top of all the function definitions, below imports and constants. That way I see the entry-point to your script easily.
Since you ask for performance: You have a sleep(0.5) at the very bottom of the script, outside of the for-loop above it. I cannot think of a reason where that's necessary except if you start your script by double-clicking and want to wait half a second between main routine is finished and the window closing.
You user.csv in read+append mode. Then you write a header. A header in CSV only makes sense in the first line of the file. That would mean you write a new file, otherwise you duplicate the header. I would recommend checking if the file exists with pathlib.Path (os.path works, but is not pythonic). Also, you never actually read from the file, so a mode is fine.
If you use pathlib.Path, you can simplify reading top_link.txt to one simple function call that comes with Path.
The beginning of main is there to create sites. After this is generated, you don't need text and book_id anymore. That is perfect to create a function.
if line is not None: this can never be the case, since "".join([]) is a str of length 0. So you probably want to check that you have a non-empty string. Then you can just write if line:.
The same argument goes for tree.css_first. From the signature of that function it does not look like it could return None, but I'm not familiar with the library to be sure here.
You have three different sleep times in your code, namely 0.25, 0.5, and 1. You only explained where one of these come from. This is an anti-pattern called "magic numbers". Replace them by constants with a good name.
Your code had two ZWSPs (zero-width spaces) in it, which upset part of my IDE. I have removed them.
Now to the meat of the script: score_user has 5 levels of indentation. That is very hard to read, and often a sign of convoluted code. Especially if you have nested ifs, you can reverse the logic of the conditions and continue the loop or return from the function early. That makes it clear what's happening in these cases without having to read the body of the big loop. I refactored that in a slightly unconventional way. I first checked if the status code is not 200 or 429 and handled that. Then I handled the case of 429. Then the rest of the function handles 200.
Calculating the individual values of score_list looks fairly complex, and therefore deserves its own function.

If you apply these recommendations, the code looks like this.

from pathlib import Path
from typing import List
import requests
from requests import adapters
from fake_useragent import UserAgent
from selectolax.lexbor import LexborHTMLParser
from time import sleep
from tqdm import tqdm
RATE_LIMIT_DELAY = 1
MINIMUM_DELAY_BETWEEN_REQUESTS = 0.25
USER_CSV_PATH = Path("user.csv")
TOP_LINK_PATH = Path("top_link.txt")
CSV_HEADER = "book; user; book_rating; rating_rating; publication_date; comment \n"
# Activate UserAgent
useragent = UserAgent()
def main():
 sites = generate_sites_from_file(TOP_LINK_PATH)
 if not USER_CSV_PATH.exists():
 create_file_with_header(USER_CSV_PATH)
 with open("user.csv", "a", encoding="utf-8") as file:
 for url in tqdm(sites):
 html, status_code = get_html(url)
 line = "".join(score_user(score_link(html, url)))
 if line:
 file.write(line)
# Get the html page and request response status.
def get_html(url):
 headers = {"Accept": "*/*", "User-Agent": useragent.random}
 # Establish a permanent connection
 session = requests.Session()
 session.headers = headers
 adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
 session.mount("http://", adapter)
 resp = requests.get(url, headers=headers)
 html = resp.text
 return html, resp.status_code
# Get links to review pages
def score_link(html, url):
 tree = LexborHTMLParser(html)
 tree_users_list = tree.css_first(r"span.page-links")
 link_list = []
 # Users without this element have no reviews
 if tree_users_list is not None:
 tree_users = tree_users_list.css(r"a")
 for user in tree_users:
 # Link to comment page
 link = url + user.attributes["href"]
 link_list.append(link)
 return link_list
 else:
 link_list.append(url)
 return link_list
# Get user feedback
def score_user(links):
 score_list = []
 # Follow links to review pages
 for url in links:
 html, status_code = get_html(url)
 tree = LexborHTMLParser(html)
 # Check server response
 if status_code != 200:
 print(f'ERROR_{status_code}:{url}')
 sleep(RATE_LIMIT_DELAY)
 continue 
 # now status_code must be 200
 score = tree.css("div.responses-list > div.response-item")
 if score is not None:
 # Go through reviews
 for user in score:
 score_list.append(generate_score_list_entry(url, user))
 sleep(MINIMUM_DELAY_BETWEEN_REQUESTS)
 return score_list
def generate_score_list_entry(url, user):
 book_link = url.split("?")[0]
 user_id = user.css_first(r"p.response-autor-info>b>a").attributes["href"]
 book_rating = user.css_first(r"div.clearfix>div.response-autor-mark>b>span").text()
 comment_rating = user.css_first(r"div.response-votetab>span:nth-of-type(2)").text()
 data_score = user.css_first(r"p.response-autor-info>span").attributes["content"]
 body_score = user.css_first(r"div.response-body-home").text().replace("\n", " ")
 value = f"{book_link};{user_id};{book_rating};{comment_rating};{data_score};{body_score}\n"
 return value
def create_file_with_header(path: Path) -> None:
 with open(path, "w") as fd:
 fd.write(CSV_HEADER)
def generate_sites_from_file(path: Path) -> List[str]:
 text = path.read_text()
 book_id = [int(element.strip("'{}")) for element in text.split(", ")]
 return [f"https://fantlab.ru/work{i}" for i in sorted(book_id)]
if __name__ == "__main__":
 main()

Thank you for the precise and quick answer, I picked up a lot of useful information.
That's how it was written in the original post. I kept it as is.

Stack Exchange Network

Parsing: Сollecting data from a book site

A task:

Problem:

Question:

Сode structure:

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing: Сollecting data from a book site

A task:

Problem:

Question:

Сode structure:

Code:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions