4
\$\begingroup\$

I'm beggining Python. So I wrote a program which is supposed to get a number of connected people on a forum (like this one : http://www.jeuxvideo.com/forums/0-51-0-1-0-1-0-blabla-18-25-ans.htm) and store the number with the datetime in a database file (SQLite3). Every forum has his own table name.

My code is supposed to do this :

  1. Create object for each forum we want to retrieve with the Forum class.
  2. Store these objects in a list to put use in a loop For.
  3. Get a web page .htm (with requests) where the number of connected people is wrote in a span tag with the class "nb-connect-fofo" who looks like this <span class="nb-connect-fofo">1799 connecté(s)</span>. I'm using BeautifulSoup to get the string and REGEX to get the number It's supposed to be done for every forum
  4. Execute a SQLite3 request to store the datetime, number in database file with the same name as the forum which is retrieve

Here's my code :

#!/usr/bin/python3
from bs4 import BeautifulSoup
from time import sleep
import sqlite3
import datetime
import requests
import re
class Forum:
 def __init__(self, forum, url_forum): #initialization all object with there name, URL
 self.forum = forum
 self.url_forum = url_forum
 pattern = '([0-9]{1,5})'
 self.pattern = re.compile(pattern)
 def add_to_database(self): #Add to the SQLite3 database the number of connected people and the datetime to their own table
 connection = sqlite3.connect("database.db")
 c = connection.cursor()
 now = datetime.datetime.today()
 nb_co = self.recup_co()
 text = "INSERT INTO {0}(datetime, nb_co) VALUES('{1}', '{2}')".format(self.forum, now, nb_co)
 c.execute(text)
 connection.commit()
 connection.close()
 print(now, self.forum, str(nb_co))
 sleep(1)
 def recup_co(self): #Retrieving the page and the number of people connected by using REGEX
 r = requests.get(self.url_forum)
 page_html = str(r.text)
 page = BeautifulSoup(page_html, 'html.parser') 
 resultat = page.select(".nb-connect-fofo")
 nb_co = re.search(self.pattern, str(resultat))
 return nb_co.group(0)
def main(): 
 # All forums which are scanned are here
 dixhuit_vingtcinq = Forum("dixhuit_vingtcinq", "http://www.jeuxvideo.com/forums/0-51-0-1-0-1-0-blabla-18-25-ans.htm")
 moins_quinze = Forum("moins_quinze", "http://www.jeuxvideo.com/forums/0-15-0-1-0-1-0-blabla-moins-de-15-ans.htm")
 quinze_dixhuit = Forum("quinze_dixhuit", "http://www.jeuxvideo.com/forums/0-50-0-1-0-1-0-blabla-15-18-ans.htm")
 overwatch = Forum("overwatch", "http://www.jeuxvideo.com/forums/0-33972-0-1-0-1-0-overwatch.htm")
 #All forum name's are stored here to use them with a list
 forums = [dixhuit_vingtcinq, moins_quinze, quinze_dixhuit, overwatch] 
 while(True):
 for forum in forums:
 try:
 forum.add_to_database()
 except:
 print("An error occured with the forum '{0}' at {1}".format(forum.forum, datetime.datetime.today()))
 sleep(5)
 sleep(60)
main()

I will use it later to make graphics, make little statistics to improve my skill with Python. Maybe I will retrieve more forum and expand my program to scrap the website and get every post on these forums (If I, I will do this in a lot of time later).

So I'm asking you for some improvements/ideas. As a beginner, there are obviously somes errors that can be very annoying. Because, i really want to improve

Also, my code is running on one of my own server. Isn't it better to buy a cheap VPS for 2€ instead ?

Thanks for reading and thanking you in advance.

PS : If there are somes mistakes relatives to my post about the website tell me

asked Jun 30, 2017 at 14:06
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Code smells

  • your code is vulnerable to SQL injection attacks because you are using string formatting to put query parameters into a query. You need to proper parameterize your query with the help of the database driver:

    query = """
     INSERT INTO {table} (datetime, nb_co)
     VALUES(?, ?)
    """.format(table=self.forum)
    c.execute(query, (now, nb_co))
    

    Note that this way you also don't need to worry about Python to database type conversions and quotes inside parameters - it will all be handled by the database driver.

Performance

  • instead of re-connecting to the database multiple times, think about connecting to a database once, processing all the data and then closing the connection afterwards
  • same idea about the use of requests - you may initialize a Session() and reuse
  • use lxml instead of html.parser as an underlying parser used by BeautifulSoup
  • you can use SoupStrainer class to parse only the desired element, which will allow you to then simply get the text and split by space instead of applying a regular expression:

    parse_only = SoupStrainer(class_="nb-connect-fofo")
    page = BeautifulSoup(page_html, 'lxml', parse_only=parse_only)
    return page.get_text().split()[0]
    
answered Jun 30, 2017 at 17:16
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.