I am making a program that will extract the data from http://www.gujarat.ngosindia.com/
I wrote the following code :
def split_line(text):
words = text.split()
i = 0
details = ''
while ((words[i] !='Contact')) and (i<len(words)):
i=i+1
if(words[i] == 'Contact:'):
break
while ((words[i] !='Purpose')) and (i<len(words)):
if (words[i] == 'Purpose:'):
break
details = details+words[i]+' '
i=i+1
print(details)
def get_ngo_detail(ngo_url):
html=urlopen(ngo_url).read()
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'border3'})
td = soup.find('td', {'class': 'border'})
split_line(td.text)
def get_ngo_names(gujrat_url):
html = urlopen(gujrat_url).read()
soup = BeautifulSoup(html)
for link in soup.findAll('div',{'id':'mainbox'}):
for text in link.find_all('a'):
print(text.get_text())
ngo_link = 'http://www.gujarat.ngosindia.com/'+text.get('href')
get_ngo_detail(ngo_link)
#NGO_name = text2.get_text())
a = get_ngo_names(BASE_URL)
print a
But when i run this script i only get the name of NGOs and contact person. I want Email, telephone number, website, purpose and contact person.
-
as a first step towards finding a solution, try throwing in a couple of print() to verify that the data is correct/what you expect in all instances...Fredrik Pihl– Fredrik Pihl2014年01月28日 11:36:19 +00:00Commented Jan 28, 2014 at 11:36
-
Or use pdb to step into the code.Dmitriy Khaykin– Dmitriy Khaykin2014年01月28日 15:21:42 +00:00Commented Jan 28, 2014 at 15:21
2 Answers 2
Your split_line could be improved. Imagine you have this text:
s = """Add: 3rd Floor Khemha House
Drive in Road, Opp Drive in Cinema
Ahmedabad - 380 054
Gujarat
Tel: 91-79-7457611 , 79-7450378
Email: [email protected]
Website: http://www.aavishkaar.org
Contact: Angha Mitra
Purpose: Economics and Finance, Micro-enterprises
Aim/Objective/Mission: To provide timely financing, management support and professional expertise ..."""
Now we can turn this into lines using s.split("\n") (split on each new line), giving a list where each item is a line:
lines = s.split("\n")
lines == ['Add: 3rd Floor Khemha House',
'Drive in Road, Opp Drive in Cinema',
...]
We can define a list of the elements we want to extract, and a dictionary to hold the results:
targets = ["Contact", "Purpose", "Email"]
results = {}
And work through each line, capturing the information we want:
for line in lines:
l = line.split(":")
if l[0] in targets:
results[l[0]] = l[1]
This gives me:
results == {'Contact': ' Angha Mitra',
'Purpose': ' Economics and Finance, Micro-enterprises',
'Email': ' [email protected]'}
4 Comments
prints in to find out what you're getting and how to process it.Try to split the contents of the ngos site better, you can give the "split" method a regular expression to split by. e.g. "[Contact]+[Email]+[telephone number]+[website]+[purpose]+[contact person]
My regular expression could be wrong but this is the direction you should head in.