I'm new to Python and would like some advice or guidance moving forward. I'm trying to parse Wikipedia data into something uniform that I can put into a database. I've looked at wiki parsers but from what I can see they are large and complex and don't get me much as I don't need 99% of their functionality (I'm not editing, or the ilk). What I am doing is reading some information from template variables and trying to clean them up into useable information. I've already created a simple function to read a wiki template and return a dictionary of key/values. These key/values are what I'm reading and trying to parse into useful data.
For example, I'm trying to parse the Infobox settlement
template to create the following information:
{'CITY': u'Portland',
'COUNTRY': u'United States of America',
'ESTABLISHED_DATE': u'',
'LATITUDE': 40.43388888888889,
'LONGITUDE': -84.98,
'REGION': u'Indiana',
'WIKI': u'Portland, Indiana'}
The only item not directly from the template is the WIKI
entry, this is the wiki page title the template if from. The raw template that the above is produced from is:
{{Infobox settlement
|official_name = Portland, Indiana
|native_name =
|settlement_type = [[City]]
|nickname =
|motto =
|image_skyline = BlueBridge.jpg
|imagesize = 250px
|image_caption = Meridian (arch) Bridge in the fog
|image_flag =
|image_seal =
|image_map = Jay_County_Indiana_Incorporated_and_Unincorporated_areas_Portland_Highlighted.svg
|mapsize = 250px
|map_caption = Location in the state of [[Indiana]]
|image_map1 =
|mapsize1 =
|map_caption1 =
|coordinates_display = inline,title
|coordinates_region = US-IN
|subdivision_type = [[List of countries|Country]]
|subdivision_name = [[United States]]
|subdivision_type1 = [[Political divisions of the United States|State]]
|subdivision_name1 = [[Indiana]]
|subdivision_type2 = [[List of counties in Indiana|County]]
|subdivision_name2 = [[Jay County, Indiana|Jay]]
|government_type =
|leader_title = [[Mayor]]
|leader_name = Randy Geesaman ([[Democratic Party (United States)|D]])
|leader_title1 = <!-- for places with, say, both a mayor and a city manager -->
|leader_name1 =
|leader_title2 =
|leader_name2 =
|leader_title3 =
|leader_name3 =
|established_title = <!-- Settled -->
|established_date =
|established_title2 = <!-- Incorporated (town) -->
|established_date2 =
|established_title3 = <!-- Incorporated (city) -->
|established_date3 =
|area_magnitude = 1 E7
|area_total_sq_mi = 4.65
|area_land_sq_mi = 4.65
|area_water_sq_mi = 0.00
|area_water_percent = 0
|area_urban_sq_mi =
|area_metro_sq_mi =
|population_as_of = 2010
|population_note =
|population_total = 6223
|population_density_km2 = 604.7
|population_density_sq_mi = 1566.8
|population_metro =
|population_density_metro_km2 =
|population_density_metro_sq_mi =
|population_urban =
|timezone = [[North American Eastern Time Zone|EST]]
|utc_offset = -5
|timezone_DST = [[North American Eastern Time Zone|EDT]]
|utc_offset_DST = -4
|latd = 40 |latm = 26 |lats = 2 |latNS = N
|longd = 84 |longm = 58 |longs = 48 |longEW = W
|elevation_m = 277
|elevation_ft = 909
|postal_code_type = [[ZIP code]]
|postal_code = 47371
|website = http://www.thecityofportland.net
|area_code = [[Area code 260|260]]
|blank_name = [[Federal Information Processing Standard|FIPS code]]
|blank_info = 18-61236{{GR|2}}
|blank1_name = [[Geographic Names Information System|GNIS]] feature ID
|blank1_info = 0441471{{GR|3}}
|footnotes =
}}
To get the results, I first get the template information into a dictionary with this function. Its sole job is to find a template and to return a dictionary of key/value information for me to use:
"""Find a template"""
def __getTemplate(self, name, input=""):
if (input == ""):
input = self.rawPage
startIndex = input.lower().find("{{" + name.lower()) + 2 + len(name)
length = len(input)
braces = 0
result = ""
for i in range(startIndex, length):
c = input[i]
if (c == "{"):
braces += 1
elif (c == "}" and braces > 0):
braces -= 1
elif (c == "["):
braces += 1
elif (c == "]" and braces > 0):
braces -= 1
elif (c == "<" and input[i+1] == "!"):
braces += 1
elif (c == ">" and braces > 0 and input[i-1] == "-"):
braces -= 1
elif (c == "}" and braces == 0):
result = result.strip()
parts = result.split("|")
dict = {}
counter = 0
for part in parts:
part = part.strip()
kv = part.split("=")
key = kv[0].strip()
if (len(key) > 0):
val = ""
if (len(kv) > 1):
val = kv[1].strip().replace("%!%!%", "|").replace("%@%@%", "=")
else:
val = key;
key = counter
counter += 1
dict[key] = val
return dict
elif (c == "|" and braces > 0):
c = "%!%!%"
elif (c == "=" and braces > 0):
c = "%@%@%"
result += c
It seems to work well enough - it's returning what I'm expecting. Any suggestions are welcome, like I said I'm new to Python and I'm sure there are better ways to do what I'm doing.
The results from this function is a dictionary that represents the template, the values are still a mess of free form text. I pass the dictionary to the next function. This is the one I would like most of the advice on - it's a mess and needs cleaned up. The if
s and the replace
s are all over the place. Before I go and clean that up I was hoping to get any advice or suggestions on other more Python ways of doing things.
"""Parse the Infobox settlement template"""
def __parseInfoboxSettlement(self):
values = self.__getTemplate("Infobox settlement")
settlement = {}
settlement['WIKI'] = self.title
if values == None:
return settlement
# Get the settlement established date
if 'established_date' in values:
if 'established_date2' not in values:
settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date'])
else:
if len(values['established_date']) > len(values['established_date2']):
settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date'])
else:
settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date2'])
if len(settlement['ESTABLISHED_DATE']) == 4:
settlement['ESTABLISHED_YEAR'] = settlement['ESTABLISHED_DATE']
settlement['ESTABLISHED_DATE'] = u""
else:
settlement['ESTABLISHED_YEAR'] = settlement['ESTABLISHED_DATE'].split("-")[0]
# Get the settlement latitude
try:
deg = 0.0
min = 0.0
sec = 0.0
if 'latd' in values:
match = re.findall("([0-9]*)", values['latd'])[0]
if len(match) > 0:
deg = float(match)
if 'lat_d' in values:
match = re.findall("([0-9]*)", values['lat_d'])[0]
if len(match) > 0:
deg = float(match)
if 'latm' in values:
match = re.findall("([0-9]*)", values['latm'])[0]
if len(match) > 0:
min = float(match)
if 'lat_m' in values:
match = re.findall("([0-9]*)", values['lat_m'])[0]
if len(match) > 0:
min = float(match)
if 'lats' in values:
match = re.findall("([0-9]*)", values['lats'])[0]
if len(match) > 0:
sec = float(match)
if 'lat_s' in values:
match = re.findall("([0-9]*)", values['lat_s'])[0]
if len(match) > 0:
sec = float(match)
lat = deg + min/60 + sec/3600
if 'latNS' in values:
if values['latNS'].lower() == "s":
lat = 0 - lat
if 'lat_NS' in values:
if values['lat_NS'].lower() == "s":
lat = 0 - lat
settlement['LATITUDE'] = lat
except:
pass
# get the settlement longitude
try:
deg = 0.0
min = 0.0
sec = 0.0
if 'longd' in values:
match = re.findall("([0-9]*)", values['longd'])[0]
if len(match) > 0:
deg = float(match)
if 'long_d' in values:
match = re.findall("([0-9]*)", values['long_d'])[0]
if len(match) > 0:
deg = float(match)
if 'longm' in values:
match = re.findall("([0-9]*)", values['longm'])[0]
if len(match) > 0:
min = float(match)
if 'long_m' in values:
match = re.findall("([0-9]*)", values['long_m'])[0]
if len(match) > 0:
min = float(match)
if 'longs' in values:
match = re.findall("([0-9]*)", values['longs'])[0]
if len(match) > 0:
sec = float(match)
if 'long_s' in values:
match = re.findall("([0-9]*)", values['long_s'])[0]
if len(match) > 0:
sec = float(match)
long = deg + min/60 + sec/3600
if 'longEW' in values:
if values['longEW'].lower() == "w":
long = 0 - long
if 'long_EW' in values:
if values['long_EW'].lower() == "w":
long = 0 - long
settlement['LONGITUDE'] = long
except:
pass
# Figure out the country and region
settlement['COUNTRY'] = u""
settlement['REGION'] = u""
count = 0
num = u""
while True:
name = u""
type = u""
if 'subdivision_name' + num in values:
name = values['subdivision_name' + num].replace("[[", "").replace("]]","")
name = name.replace("{{flag|","").replace("{{","").replace("}}","").strip()
name = name.replace(u"flagicon",u"").strip()
if 'subdivision_type' + num in values:
type = values['subdivision_type' + num].strip().lower()
# Catch most issues
if u"|" in name:
parts = name.split("|")
first = True
for part in parts:
if len(part) > 0 and u"name=" not in part and not first:
name = part
first = False
# Nead with name= things the above missed
if u"|" in name:
parts = name.split("|")
for part in parts:
if len(part) > 0 and u"name=" not in part:
name = part
# Double name
parts = name.split(" ")
if len(parts) == 2:
if parts[0].lower() == parts[1].lower():
name = parts[0]
if u"country" in type and u"historic" not in type:
if u"united states" in name.lower() or name.lower() == u"usa":
settlement['COUNTRY'] = u"United States of America"
elif name.lower() == u"can":
settlement['COUNTRY'] = u"Canada"
else:
settlement['COUNTRY'] = name
elif (u"state" in type and u"counties" not in type and u"county" not in type and u"|region" not in type) or u"federal district" in type:
# US State
settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
elif (u"canada|province" in type):
# Canada
settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
settlement['REGION'] = settlement['REGION'].replace(u"QC", u"Québec")
elif (u"|region" in type and settlement['REGION'] == u""):
settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
elif type != u"":
self.__log("XXX subdivision_type: " + type)
count += 1
num = str(count)
if type == u"":
break
# Cleanup the city name
settlement['CITY'] = u""
name = u""
if 'official_name' in values:
name = values['official_name'].replace(u"[[", u"").replace(u"]]",u"").replace(u"{{flag|",u"").replace(u"}}",u"").strip()
if 'name' in values and name == u"":
name = values['name'].replace(u"[[", u"").replace(u"]]",u"").replace(u"{{flag|",u"").replace(u"}}",u"").strip()
if name != u"":
name = name.replace(u"The City of ",u"").replace(u"City of ",u"").replace(u"Town of ",u"").replace(u"Ville de ", u"").replace(u" Township", "")
if u"{{flagicon" in name:
parts = name.split("}}")
name = parts[1].strip()
if u"<ref>" in name:
parts = name.split("<ref>")
if len(parts) > 0:
name = parts[0].strip()
if u"<br />" in name:
parts = name.split("<br />")
for part in parts:
if u"img" not in part and len(part) > 1:
name = part
if u"," in name:
parts = name.split(",");
if len(parts) > 1:
if parts[1].strip().lower() in settlement['REGION'].lower():
if parts[0].strip() not in settlement['REGION'].lower():
settlement['CITY'] = parts[0].strip()
elif parts[1].strip() not in settlement['REGION'].lower():
settlement['CITY'] = parts[1].strip()
elif name.lower() not in settlement['REGION'].lower():
settlement['CITY'] = name
else:
self.__log("XXX settlement_type: " + type)
# Set the results
self.locationData.append(settlement)
I know this code is a mess, and there are several blocks of code marked by a comment line explaining what they do. The rest is based on trial and error looking over 30-40 pages in the wiki. This works 99% of the time from what I have tried so far. I'm going through a list of 100-200 locations and just haven't hit many of the international locations yet.
Besides this code being a mess, is there anything I should be doing more Python-esque that I'm not? Besides cleaning things up, are there any other suggestions you have on how to go about doing what I'm doing?
-
1\$\begingroup\$ Just saying, a wiki parser is too complex, but a 200 line mega-function isn't? Sounds like you are reinventing the wheel here. Besides that, this would be more suited to Code Review, as there is no specific question. \$\endgroup\$Lattyware– Lattyware2012年10月08日 22:37:31 +00:00Commented Oct 8, 2012 at 22:37
-
\$\begingroup\$ @Lattyware - The wiki parse wouldn't do what this function is doing anyway. All it would do is give me a dictionary key/values that my fist function gives me. It may save me some [[ and ]] replaces but I dont think thats worth moving to a wiki parser. Maybe I'm wrong. Also most wikiparsers I've found haven't been maintained in a long time. \$\endgroup\$Justin808– Justin8082012年10月08日 22:40:32 +00:00Commented Oct 8, 2012 at 22:40
-
1\$\begingroup\$ You might be better off using an HTML/DOM parser, and parsing the generated HTML rather than trying to parse the wikitext \$\endgroup\$ernie– ernie2012年10月08日 23:20:30 +00:00Commented Oct 8, 2012 at 23:20
-
\$\begingroup\$ @ernie - wikipeida doesn't like you doing that. \$\endgroup\$Justin808– Justin8082012年10月08日 23:38:37 +00:00Commented Oct 8, 2012 at 23:38
-
\$\begingroup\$ They don't like you parsing their html, but they're okay with you parsing the wikitext? I'd imaging they don't like any scraping at all? \$\endgroup\$ernie– ernie2012年10月08日 23:41:39 +00:00Commented Oct 8, 2012 at 23:41
1 Answer 1
Wikipedia pages have this great comment line-- <!-- Infobox begins -->
to tell you where infoboxes start and end. Use that to find the information in the infobox.
You end up with a string, lets call it infobox
.
# List Comprehension over infobox to return values
info = [j.split("=") for j in [i for i in infobox.split('|')]][1:]
# And, here's your dict:
wikidict = {}
for i in info:
try:
# stripping here is best
wikidict[i[0].strip()] = i[1].strip()
except IndexError:
pass # if there's no data, there's no i[1], and an IndexError is thrown
That said, the template values are just that-- template values. If you want the latitude, you dont need to code anything complex-- the dictionary keys are already there.
latkeys = "latd latm lats latNS".split()
lat_info = [wikidict[i] for i in latkeys]
You could easily do a quick transformation over the lat_info to get things into the format you want.
You should also probably write a separate function that strips the [[x|y]]
from certain elements, and provides a return as a tuple if you're interested in manipulating those. As it stands, your code is nearly impossible to read. You dont need to push strings through complex logic gates like the ones you have; keep the logic to a bare minimum. You know, the Keep It Simple, Stupid rule.
-
\$\begingroup\$ The problem I'm running into is that latd may or maynot contain a number. It might be in lat_d for that matter. so once I find the d m and s (depending on which exist or not) I do the math to get a latitude. Its just hard as there is no rhyme or reason to some fields. \$\endgroup\$Justin808– Justin8082012年10月08日 23:37:11 +00:00Commented Oct 8, 2012 at 23:37
-
\$\begingroup\$
info = [j.split("=") for j in [i for i in infobox.split('|')]][1:]
looks a lot simpler than what I have to read a dictionary but how does is handle a = or a | being in the value of a key? I started off with KISS but each time I saw and issue, another if or replace went in to fix it. \$\endgroup\$Justin808– Justin8082012年10月08日 23:38:06 +00:00Commented Oct 8, 2012 at 23:38 -
\$\begingroup\$ Here's the beauty of this solution. You don't actually need the stuff behind the
|
in the value. For example, if you see|subdivision_type2 = [[List of counties in Indiana|County]]
, the next line is|subdivision_name2 = [[Jay County, Indiana|Jay]]
, and you throw away the first data piece, and you return'[[Jay County, Indiana'
forwikidict['subdivision_name2']
\$\endgroup\$kreativitea– kreativitea2012年10月09日 00:12:41 +00:00Commented Oct 9, 2012 at 0:12 -
\$\begingroup\$ @Justin808 and in either case, the lats have
lat
at the beginning of the name, right?latkeys = [i for i in wikidict.keys() if all(('lat' in i, 'population' not in i))]
(population has 'lat' in it) \$\endgroup\$kreativitea– kreativitea2012年10月09日 00:19:45 +00:00Commented Oct 9, 2012 at 0:19