5
\$\begingroup\$

I'm new to Python and would like some advice or guidance moving forward. I'm trying to parse Wikipedia data into something uniform that I can put into a database. I've looked at wiki parsers but from what I can see they are large and complex and don't get me much as I don't need 99% of their functionality (I'm not editing, or the ilk). What I am doing is reading some information from template variables and trying to clean them up into useable information. I've already created a simple function to read a wiki template and return a dictionary of key/values. These key/values are what I'm reading and trying to parse into useful data.

For example, I'm trying to parse the Infobox settlement template to create the following information:

{'CITY': u'Portland',
 'COUNTRY': u'United States of America',
 'ESTABLISHED_DATE': u'',
 'LATITUDE': 40.43388888888889,
 'LONGITUDE': -84.98,
 'REGION': u'Indiana',
 'WIKI': u'Portland, Indiana'}

The only item not directly from the template is the WIKI entry, this is the wiki page title the template if from. The raw template that the above is produced from is:

{{Infobox settlement
|official_name = Portland, Indiana
|native_name = 
|settlement_type = [[City]]
|nickname = 
|motto = 
|image_skyline = BlueBridge.jpg
|imagesize = 250px
|image_caption = Meridian (arch) Bridge in the fog
|image_flag = 
|image_seal = 
|image_map = Jay_County_Indiana_Incorporated_and_Unincorporated_areas_Portland_Highlighted.svg
|mapsize = 250px
|map_caption = Location in the state of [[Indiana]]
|image_map1 = 
|mapsize1 = 
|map_caption1 = 
|coordinates_display = inline,title
|coordinates_region = US-IN
|subdivision_type = [[List of countries|Country]]
|subdivision_name = [[United States]]
|subdivision_type1 = [[Political divisions of the United States|State]]
|subdivision_name1 = [[Indiana]]
|subdivision_type2 = [[List of counties in Indiana|County]]
|subdivision_name2 = [[Jay County, Indiana|Jay]]
|government_type = 
|leader_title = [[Mayor]]
|leader_name = Randy Geesaman ([[Democratic Party (United States)|D]])
|leader_title1 = <!-- for places with, say, both a mayor and a city manager -->
|leader_name1 = 
|leader_title2 = 
|leader_name2 = 
|leader_title3 = 
|leader_name3 = 
|established_title = <!-- Settled -->
|established_date = 
|established_title2 = <!-- Incorporated (town) -->
|established_date2 = 
|established_title3 = <!-- Incorporated (city) -->
|established_date3 = 
|area_magnitude = 1 E7
|area_total_sq_mi = 4.65
|area_land_sq_mi = 4.65
|area_water_sq_mi = 0.00
|area_water_percent = 0
|area_urban_sq_mi = 
|area_metro_sq_mi = 
|population_as_of = 2010
|population_note = 
|population_total = 6223
|population_density_km2 = 604.7
|population_density_sq_mi = 1566.8
|population_metro = 
|population_density_metro_km2 = 
|population_density_metro_sq_mi = 
|population_urban = 
|timezone = [[North American Eastern Time Zone|EST]]
|utc_offset = -5
|timezone_DST = [[North American Eastern Time Zone|EDT]]
|utc_offset_DST = -4
|latd = 40 |latm = 26 |lats = 2 |latNS = N
|longd = 84 |longm = 58 |longs = 48 |longEW = W
|elevation_m = 277
|elevation_ft = 909
|postal_code_type = [[ZIP code]]
|postal_code = 47371
|website = http://www.thecityofportland.net
|area_code = [[Area code 260|260]]
|blank_name = [[Federal Information Processing Standard|FIPS code]]
|blank_info = 18-61236{{GR|2}}
|blank1_name = [[Geographic Names Information System|GNIS]] feature ID
|blank1_info = 0441471{{GR|3}}
|footnotes = 
}}

To get the results, I first get the template information into a dictionary with this function. Its sole job is to find a template and to return a dictionary of key/value information for me to use:

"""Find a template"""
def __getTemplate(self, name, input=""):
 if (input == ""):
 input = self.rawPage
 startIndex = input.lower().find("{{" + name.lower()) + 2 + len(name)
 length = len(input)
 braces = 0
 result = ""
 for i in range(startIndex, length):
 c = input[i]
 if (c == "{"):
 braces += 1
 elif (c == "}" and braces > 0):
 braces -= 1
 elif (c == "["):
 braces += 1
 elif (c == "]" and braces > 0):
 braces -= 1
 elif (c == "<" and input[i+1] == "!"):
 braces += 1
 elif (c == ">" and braces > 0 and input[i-1] == "-"):
 braces -= 1
 elif (c == "}" and braces == 0):
 result = result.strip()
 parts = result.split("|")
 dict = {}
 counter = 0
 for part in parts:
 part = part.strip()
 kv = part.split("=")
 key = kv[0].strip()
 if (len(key) > 0):
 val = ""
 if (len(kv) > 1):
 val = kv[1].strip().replace("%!%!%", "|").replace("%@%@%", "=")
 else:
 val = key;
 key = counter
 counter += 1
 dict[key] = val
 return dict
 elif (c == "|" and braces > 0):
 c = "%!%!%"
 elif (c == "=" and braces > 0):
 c = "%@%@%"
 result += c

It seems to work well enough - it's returning what I'm expecting. Any suggestions are welcome, like I said I'm new to Python and I'm sure there are better ways to do what I'm doing.

The results from this function is a dictionary that represents the template, the values are still a mess of free form text. I pass the dictionary to the next function. This is the one I would like most of the advice on - it's a mess and needs cleaned up. The ifs and the replaces are all over the place. Before I go and clean that up I was hoping to get any advice or suggestions on other more Python ways of doing things.

"""Parse the Infobox settlement template"""
def __parseInfoboxSettlement(self):
 values = self.__getTemplate("Infobox settlement")
 settlement = {}
 settlement['WIKI'] = self.title
 if values == None:
 return settlement
 # Get the settlement established date
 if 'established_date' in values:
 if 'established_date2' not in values:
 settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date'])
 else:
 if len(values['established_date']) > len(values['established_date2']):
 settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date'])
 else:
 settlement['ESTABLISHED_DATE'] = self.__parseDate(values['established_date2'])
 if len(settlement['ESTABLISHED_DATE']) == 4:
 settlement['ESTABLISHED_YEAR'] = settlement['ESTABLISHED_DATE']
 settlement['ESTABLISHED_DATE'] = u""
 else:
 settlement['ESTABLISHED_YEAR'] = settlement['ESTABLISHED_DATE'].split("-")[0]
 # Get the settlement latitude
 try:
 deg = 0.0
 min = 0.0
 sec = 0.0
 if 'latd' in values:
 match = re.findall("([0-9]*)", values['latd'])[0]
 if len(match) > 0:
 deg = float(match)
 if 'lat_d' in values:
 match = re.findall("([0-9]*)", values['lat_d'])[0]
 if len(match) > 0:
 deg = float(match)
 if 'latm' in values:
 match = re.findall("([0-9]*)", values['latm'])[0]
 if len(match) > 0:
 min = float(match)
 if 'lat_m' in values:
 match = re.findall("([0-9]*)", values['lat_m'])[0]
 if len(match) > 0:
 min = float(match)
 if 'lats' in values:
 match = re.findall("([0-9]*)", values['lats'])[0]
 if len(match) > 0:
 sec = float(match)
 if 'lat_s' in values:
 match = re.findall("([0-9]*)", values['lat_s'])[0]
 if len(match) > 0:
 sec = float(match)
 lat = deg + min/60 + sec/3600
 if 'latNS' in values:
 if values['latNS'].lower() == "s":
 lat = 0 - lat
 if 'lat_NS' in values:
 if values['lat_NS'].lower() == "s":
 lat = 0 - lat
 settlement['LATITUDE'] = lat
 except:
 pass
 # get the settlement longitude
 try:
 deg = 0.0
 min = 0.0
 sec = 0.0
 if 'longd' in values:
 match = re.findall("([0-9]*)", values['longd'])[0]
 if len(match) > 0:
 deg = float(match)
 if 'long_d' in values:
 match = re.findall("([0-9]*)", values['long_d'])[0]
 if len(match) > 0:
 deg = float(match)
 if 'longm' in values:
 match = re.findall("([0-9]*)", values['longm'])[0]
 if len(match) > 0:
 min = float(match)
 if 'long_m' in values:
 match = re.findall("([0-9]*)", values['long_m'])[0]
 if len(match) > 0:
 min = float(match)
 if 'longs' in values:
 match = re.findall("([0-9]*)", values['longs'])[0]
 if len(match) > 0:
 sec = float(match)
 if 'long_s' in values:
 match = re.findall("([0-9]*)", values['long_s'])[0]
 if len(match) > 0:
 sec = float(match)
 long = deg + min/60 + sec/3600
 if 'longEW' in values:
 if values['longEW'].lower() == "w":
 long = 0 - long
 if 'long_EW' in values:
 if values['long_EW'].lower() == "w":
 long = 0 - long
 settlement['LONGITUDE'] = long
 except:
 pass
 # Figure out the country and region
 settlement['COUNTRY'] = u""
 settlement['REGION'] = u""
 count = 0
 num = u""
 while True:
 name = u""
 type = u""
 if 'subdivision_name' + num in values:
 name = values['subdivision_name' + num].replace("[[", "").replace("]]","")
 name = name.replace("{{flag|","").replace("{{","").replace("}}","").strip()
 name = name.replace(u"flagicon",u"").strip()
 if 'subdivision_type' + num in values:
 type = values['subdivision_type' + num].strip().lower()
 # Catch most issues
 if u"|" in name:
 parts = name.split("|")
 first = True
 for part in parts:
 if len(part) > 0 and u"name=" not in part and not first:
 name = part
 first = False
 # Nead with name= things the above missed
 if u"|" in name:
 parts = name.split("|")
 for part in parts:
 if len(part) > 0 and u"name=" not in part:
 name = part
 # Double name
 parts = name.split(" ")
 if len(parts) == 2:
 if parts[0].lower() == parts[1].lower():
 name = parts[0]
 if u"country" in type and u"historic" not in type:
 if u"united states" in name.lower() or name.lower() == u"usa":
 settlement['COUNTRY'] = u"United States of America"
 elif name.lower() == u"can":
 settlement['COUNTRY'] = u"Canada"
 else:
 settlement['COUNTRY'] = name
 elif (u"state" in type and u"counties" not in type and u"county" not in type and u"|region" not in type) or u"federal district" in type:
 # US State
 settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
 elif (u"canada|province" in type):
 # Canada
 settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
 settlement['REGION'] = settlement['REGION'].replace(u"QC", u"Québec")
 elif (u"|region" in type and settlement['REGION'] == u""):
 settlement['REGION'] = name.replace("(region)","").replace("(state)","").strip()
 elif type != u"":
 self.__log("XXX subdivision_type: " + type)
 count += 1
 num = str(count)
 if type == u"":
 break
 # Cleanup the city name
 settlement['CITY'] = u""
 name = u""
 if 'official_name' in values:
 name = values['official_name'].replace(u"[[", u"").replace(u"]]",u"").replace(u"{{flag|",u"").replace(u"}}",u"").strip()
 if 'name' in values and name == u"":
 name = values['name'].replace(u"[[", u"").replace(u"]]",u"").replace(u"{{flag|",u"").replace(u"}}",u"").strip()
 if name != u"":
 name = name.replace(u"The City of ",u"").replace(u"City of ",u"").replace(u"Town of ",u"").replace(u"Ville de ", u"").replace(u" Township", "")
 if u"{{flagicon" in name:
 parts = name.split("}}")
 name = parts[1].strip()
 if u"<ref>" in name:
 parts = name.split("<ref>")
 if len(parts) > 0:
 name = parts[0].strip()
 if u"<br />" in name:
 parts = name.split("<br />")
 for part in parts:
 if u"img" not in part and len(part) > 1:
 name = part
 if u"," in name:
 parts = name.split(",");
 if len(parts) > 1:
 if parts[1].strip().lower() in settlement['REGION'].lower():
 if parts[0].strip() not in settlement['REGION'].lower():
 settlement['CITY'] = parts[0].strip()
 elif parts[1].strip() not in settlement['REGION'].lower():
 settlement['CITY'] = parts[1].strip()
 elif name.lower() not in settlement['REGION'].lower():
 settlement['CITY'] = name
 else:
 self.__log("XXX settlement_type: " + type)
 # Set the results
 self.locationData.append(settlement)

I know this code is a mess, and there are several blocks of code marked by a comment line explaining what they do. The rest is based on trial and error looking over 30-40 pages in the wiki. This works 99% of the time from what I have tried so far. I'm going through a list of 100-200 locations and just haven't hit many of the international locations yet.

Besides this code being a mess, is there anything I should be doing more Python-esque that I'm not? Besides cleaning things up, are there any other suggestions you have on how to go about doing what I'm doing?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Oct 8, 2012 at 22:30
\$\endgroup\$
6
  • 1
    \$\begingroup\$ Just saying, a wiki parser is too complex, but a 200 line mega-function isn't? Sounds like you are reinventing the wheel here. Besides that, this would be more suited to Code Review, as there is no specific question. \$\endgroup\$ Commented Oct 8, 2012 at 22:37
  • \$\begingroup\$ @Lattyware - The wiki parse wouldn't do what this function is doing anyway. All it would do is give me a dictionary key/values that my fist function gives me. It may save me some [[ and ]] replaces but I dont think thats worth moving to a wiki parser. Maybe I'm wrong. Also most wikiparsers I've found haven't been maintained in a long time. \$\endgroup\$ Commented Oct 8, 2012 at 22:40
  • 1
    \$\begingroup\$ You might be better off using an HTML/DOM parser, and parsing the generated HTML rather than trying to parse the wikitext \$\endgroup\$ Commented Oct 8, 2012 at 23:20
  • \$\begingroup\$ @ernie - wikipeida doesn't like you doing that. \$\endgroup\$ Commented Oct 8, 2012 at 23:38
  • \$\begingroup\$ They don't like you parsing their html, but they're okay with you parsing the wikitext? I'd imaging they don't like any scraping at all? \$\endgroup\$ Commented Oct 8, 2012 at 23:41

1 Answer 1

2
\$\begingroup\$

Wikipedia pages have this great comment line-- <!-- Infobox begins --> to tell you where infoboxes start and end. Use that to find the information in the infobox.

You end up with a string, lets call it infobox.

# List Comprehension over infobox to return values
info = [j.split("=") for j in [i for i in infobox.split('|')]][1:]
# And, here's your dict:
wikidict = {}
for i in info:
 try:
 # stripping here is best
 wikidict[i[0].strip()] = i[1].strip()
 except IndexError:
 pass # if there's no data, there's no i[1], and an IndexError is thrown

That said, the template values are just that-- template values. If you want the latitude, you dont need to code anything complex-- the dictionary keys are already there.

 latkeys = "latd latm lats latNS".split()
 lat_info = [wikidict[i] for i in latkeys]

You could easily do a quick transformation over the lat_info to get things into the format you want.

You should also probably write a separate function that strips the [[x|y]] from certain elements, and provides a return as a tuple if you're interested in manipulating those. As it stands, your code is nearly impossible to read. You dont need to push strings through complex logic gates like the ones you have; keep the logic to a bare minimum. You know, the Keep It Simple, Stupid rule.

answered Oct 8, 2012 at 23:16
\$\endgroup\$
4
  • \$\begingroup\$ The problem I'm running into is that latd may or maynot contain a number. It might be in lat_d for that matter. so once I find the d m and s (depending on which exist or not) I do the math to get a latitude. Its just hard as there is no rhyme or reason to some fields. \$\endgroup\$ Commented Oct 8, 2012 at 23:37
  • \$\begingroup\$ info = [j.split("=") for j in [i for i in infobox.split('|')]][1:] looks a lot simpler than what I have to read a dictionary but how does is handle a = or a | being in the value of a key? I started off with KISS but each time I saw and issue, another if or replace went in to fix it. \$\endgroup\$ Commented Oct 8, 2012 at 23:38
  • \$\begingroup\$ Here's the beauty of this solution. You don't actually need the stuff behind the | in the value. For example, if you see |subdivision_type2 = [[List of counties in Indiana|County]], the next line is |subdivision_name2 = [[Jay County, Indiana|Jay]], and you throw away the first data piece, and you return '[[Jay County, Indiana' for wikidict['subdivision_name2'] \$\endgroup\$ Commented Oct 9, 2012 at 0:12
  • \$\begingroup\$ @Justin808 and in either case, the lats have lat at the beginning of the name, right? latkeys = [i for i in wikidict.keys() if all(('lat' in i, 'population' not in i))] (population has 'lat' in it) \$\endgroup\$ Commented Oct 9, 2012 at 0:19

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.