0

I am trying to scrape form field IDs using Beautiful Soup like this

 for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
 if link.has_key('id'):
 print link['id']

Lets us assume that it returns something like

username
email
password
passwordagain
terms
button_register

I would like to write this into Sqlite3 DB.

What I will be doing down the line in my application is... Use these form fields' IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this...

Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on

What I am unsure here is... How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns - say

COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)

One thing to be remembered is... Down the line in my application I will need to do something like this...

Pseudocode

If Domain is "http://somedomain.com":
 For ever item in the COL2 (which is a list of form field ids):
 Assign some set of values to each of the form fields & then make a POST request

Can any one guide, please?

EDITed on 22/07/2011 - Is My Below Database Design Correct?

I have decided to have a solution like this. What do you guys think?

I will be having three tables like below

Table 1

Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT

Sample Data would be something like:

1 http://url1.com
2 http://url2.com
3 http://url3.com

Table 2

Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)

Sample Data would be something like:

1 http://url1.com/register field1
1 http://url1.com/register field2
1 http://url1.com/register field3
2 http://url2.com/register field1
2 http://url2.com/register field2
2 http://url2.com/register field3
3 http://url3.com/register field1
3 http://url3.com/register field2
3 http://url3.com/register field3

Table 3

Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)

Sample Data would be something like:

1 Pass user1 pass1
2 Fail user2 pass2
3 Pass user3 pass3

Do you think this table design is good? Or are there any improvements that can be made?

agf
178k45 gold badges300 silver badges241 bronze badges
asked Jul 21, 2011 at 11:41

3 Answers 3

1

There is a normalization problem in your table.

Using 2 tables with

TABLE domains
int id primary key
text name
TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value

is a better solution.

answered Jul 21, 2011 at 12:10
Sign up to request clarification or add additional context in comments.

3 Comments

oh.. So you mean, in the first table domains I will have only the domain urls. While in the second table fields_ids I will have rows like domain1 - field1, domain1 - field2, domain - field3, .. domain1 - fieldn... likewise domain2 - field1, domain2 - field2, .. domain2... fieldn, etc. This sounds good too. Thanks!
@Bhavani-Kannan: You make a key for each domain. You pair these up in one database. Then you use those keys in the second database pair with each field, since it's way faster to look things up by a key than it is to look them up by just the domain. If you want to do it with just the domain name and no key, you only need one table, and you can take a look at my answer.
I have EDITed my question with what I have decided to do. So I am planning to do something like c.execute('CREATE TABLE IF NOT EXISTS base (ID INTEGER PRIMARY KEY, Domain TEXT, PR INTEGER)') c.execute('CREATE TABLE IF NOT EXISTS register (Domain INTEGER, Reglink TEXT, Fields TEXT, FOREIGN KEY(Domain) REFERENCES base(ID)') c.execute('CREATE TABLE IF NOT EXISTS regdata (Domain INTEGER, Status TEXT, Username TEXT, Password TEXT, FOREIGN KEY(Domain) REFERENCES base(ID)') Do you guys think this is an optimized way?
1

Proper database design would suggest you have a table of URLs, and a table of fields, each referenced to a URL record. But depending on what you want to do with them, you could pack lists into a single column. See the docs for how to go about that.

Is sqlite a requirement? It might not be the best way to store the data. E.g. if you need random-access lookups by URL, the shelve module might be a better bet. If you just need to record them and iterate over the sites, it might be simpler to store as CSV.

answered Jul 21, 2011 at 12:11

2 Comments

There is nothing like a requirement at all because I have no client. I am doing all this to simply the daily routines that I do in my daily life. So, it is nothing more than learning you can say. I will go through the links you have provided. Especially I wanted to use SQlite3 because I just spent couple of days learning the very basics of it. The Shelve Module is new to me. I will go through the documents. Thanks!
shelve always seems to be new to people. It's rather limited, but if all you need is a key-value store (like a dict on disk), it's easy to use.
0

Try this to get the ids:

ids = (link['id'] for link in
 BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')) 
 if link.has_key('id'))

And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It's the simplest solution, and perfectly adequate for a relatively small number of rows of data.

from itertools import izip, repeat
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')
domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
 values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()
domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))
# this is just an example
def some_function_of_row(row):
 return row[1] + ' value'
fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()
answered Jul 21, 2011 at 12:13

6 Comments

Thanks for your reply. I am just trying to understand this code... generator object, izip are new to me. I am googling and learning as I try to execute & understand your code. Thanks!
You can just as easily use a for loop to get the ids, and another for loop to insert them (using for id in ids: c.execute("insert into domains values (?, ?)", (domain_to_insert, id)) if those are making things more complex and you want to focus on the database stuff.
hey, no.. nothing makes it complex. Like you already know, I am new to Python. Been just a week. So trying to gather & learn every little possible thing.. like using , instead of + in a print, etc. I just do not want a working piece of code but a robust piece of code. That's why trying to learn & understand your way of code as well. Really, thank you!
Sure I will. I just want to try out each of these suggestion provided here. Once I am finalized with a solution I will update the thread and also accept a solution. Thanks!
This solution consumes a lot of space because the domain is stored for each field id. See my solution how to normalize the database model.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.