8
\$\begingroup\$

This is the repost of the following question as suggested by @HoboProber .

Again, if you don't know what is Schelling's model of segregation, you can read it here.

The Schelling model of segregation is an agent-based model that illustrates how individual tendencies regarding neighbors can lead to segregation. In the Schelling model, agents occupy cells of rectangular space. A cell can be occupied by a single agent only. Agents belong to one of two groups and are able to relocate according to the fraction of friends (i.e., agents of their own group) within a neighborhood around their location. The model's basic assumption is as follows: an agent, located in the center of a neighborhood where the fraction of friends f is less than a predefined tolerance threshold F (i.e., f < F), will try to relocate to a neighborhood for which the fraction of friends is at least f (i.e., f ≥ F)

You can get the shapefile here if you want to try it.

I have implemented most of the changes suggested and also did split up my code in 4 different classes. Here is my complete code.

geo_schelling_populate.py

import numpy as np
from shapely.geometry import Point
import geopandas as gp
import pandas as pd
class geo_schelling_populate:
 """ Generate the coordinates in a polygon (In this case a map of the state)
 on the basis of the given spacing and then randomly assign coordiantes 
 to different races and as empty houses. It also takes ratio and 
 empty_ratio in consideration. 
 Parameters
 ----------
 shapefile : str
 It is a string pointing to a geospatial vector data file which 
 has information for an american state's geometry data.
 spacing : float
 It defines the distance between the points in coordinate units.
 empty_ratio : float
 What percent of houses need to be kept empty.
 ratio : float
 What is the real ratio between the majority and minority in 
 reality in a given state.
 ratio : int
 Number of races. Currently the model is tested on 2 races, but 
 maybe in future support fot more races will be added.
 Attributes
 ----------
 shapefile : str
 Same as in parameter section
 spacing : float
 Same as in parameter section
 empty_ratio : float
 Same as in parameter section
 demographic_ratio : float
 Same as ratio in the parameter section
 races : int
 Same as in parameter section
 shape_file : geopandas.GeoDataFrame
 Pandas DataFrame with a geometry column generated from the .shp
 file
 df_allhouses : pandas.DataFrame
 Pandas DataFrame which contains all the different coordinates 
 generated inside a specific state.
 df_emptyhouses : pandas.DataFrame
 Pandas DataFrame which contains all the coordiantes which are 
 emptyhouses.
 df_agenthouses : pandas.DataFrame
 Pandas DataFrame which contains all the coordinates associated 
 with a race.
 Methods
 -------
 _generate_points()
 Private function! It generates the coordinates inside a given 
 shape. Returns a dataframe with all the coordiantes generated.
 populate()
 It populates the coordinates with either a race or denotes it 
 as an empty house. Returns a tuple with pandas dataframe of 
 empty houses and agent houses. Again agents are races in this 
 case.
 """
 def __init__(
 self,
 shapefile,
 spacing,
 empty_ratio,
 ratio,
 races=2,
 ):
 self.shapefile = shapefile
 self.spacing = spacing
 self.empty_ratio = empty_ratio
 self.demographic_ratio = ratio
 self.races = races
 self.shape_file = \
 gp.read_file(r"%s"%shapefile).explode().reset_index().drop(columns=['level_0'
 , 'level_1'])
 self.df_allhouses = pd.DataFrame([])
 self.df_emptyhouses = pd.DataFrame([])
 self.df_agenthouses = pd.DataFrame([])
 def _generate_points(self, polygon):
 """It returns a DataFrame with all the coordiantes inside a certain
 shape passed in as an parameter.
 Parameters
 ----------
 polygon : shapely.geometry.Polygon
 A polygon object which contains the geometry of a county in 
 a state.
 Returns
 -------
 A pandas DataFrame with all the coordiantes generated inside the 
 polygon object.
 """
 (minx, miny, maxx, maxy) = polygon.bounds
 x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)),
 self.spacing)
 y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)),
 self.spacing)
 grid = np.column_stack((np.meshgrid(x_coords,
 y_coords)[0].flatten(),
 np.meshgrid(x_coords,
 y_coords)[1].flatten()))
 df_points = pd.DataFrame.from_records(grid, columns=['X', 'Y'])
 df_points = df_points[df_points[['X', 'Y']].apply(lambda x: \
 Point(x[0], x[1]).within(polygon),
 axis=1)]
 return df_points.round(2)
 def populate(self):
 """ Populates the coordinates by assigning them a certain race or by 
 assigning them as empty houses.
 Parameters
 ----------
 No parameters
 Returns
 -------
 A tuple which consist two pandas DataFrames. One contains
 the coordiantes who have empty houses and the other contains
 the coordiante with a race assigned to them.
 """
 pd.set_option('mode.chained_assignment', None)
 self.df_allhouses = \
 pd.concat(iter(self.shape_file.geometry.apply(lambda x: \
 self._generate_points(x))), ignore_index=True)
 empty_ratio_seperator = round(self.empty_ratio
 * len(self.df_allhouses))
 self.df_allhouses['Agent/Empty'] = \
 np.random.permutation(np.asarray(['Empty']
 * empty_ratio_seperator
 + ['Agent'] * (len(self.df_allhouses)
 - empty_ratio_seperator)))
 self.df_emptyhouses = \
 self.df_allhouses[self.df_allhouses['Agent/Empty']
 == 'Empty'][['X', 'Y'
 ]].reset_index(drop=True)
 self.df_agenthouses = \
 self.df_allhouses[self.df_allhouses['Agent/Empty']
 == 'Agent'][['X', 'Y'
 ]].reset_index(drop=True)
 demographic_ratio_seperator = round(self.demographic_ratio
 * len(self.df_agenthouses))
 self.df_agenthouses['Race'] = \
 np.random.permutation(np.asarray([1]
 * demographic_ratio_seperator
 + [2] * (len(self.df_agenthouses)
 - demographic_ratio_seperator)))
 return (self.df_emptyhouses, self.df_agenthouses)

geo_schelling_update.py

import numpy as np
import math
class geo_schelling_update:
 """Updates the position of the races and the empty houses on the basis of 
 similarity threshold. Agents change positions if they have enough agents
 from their own races in a certain neighbourhood. In this scenario, 
 neighbourhood is defined as the eight nearest coordinate around the 
 coordinate we are checking satisfaction for.
 Parameters
 ----------
 n_iterations : int
 Maximum number of times, update method should run.
 spacing : float
 It defines the distance between the points in coordinate units.
 np_agenthouses : numpy array
 df_agenthouses converted to the numpy array for faster 
 computations.
 np_emptyhouses : numpy array
 df_emptyhouses converted to the numpy array for faster 
 computations.
 similarity_threshold : float
 What percent of similarity people want from their neighbours.
 Attributes
 ----------
 spacing : float
 Same as in parameter section
 n_iterations : int
 Same as in parameter section
 np_agenthouses : numpy array
 Same as in parameter section
 np_emptyhouses : numpy array
 Same as in parameter section
 similarity_threshold : float
 Same as in parameter section
 Methods
 -------
 _is_unsatisfied()
 Private function! It checks if an agent is satisfied or not at
 a certain position.
 _move_to_empty()
 Private function! Moves an unsatisified agent to a random empty
 house.
 _update_helper()
 Private functions! A helper function to help update method with 
 numpy's apply_along_axis method.
 update()
 It updates array as long as it reaches maximum number of 
 iterations or reaches a point where no agent is unsatisfied.
 Getter
 ------
 get_agenthouses()
 Returns an array with all the agents at a certain coordinate. 
 """
 def __init__(
 self,
 n_iterations,
 spacing,
 np_agenthouses,
 np_emptyhouses,
 similarity_threshold,
 ):
 self.spacing = spacing
 self.n_iterations = n_iterations
 self.np_emptyhouses = np_emptyhouses
 self.np_agenthouses = np_agenthouses
 self.similarity_threshold = similarity_threshold
 def _is_unsatisfied(self, x, y):
 """ Checks if an agent is unsatisfied at a certain position.
 Parameters
 ----------
 x : float
 x coordinate of the agent being checked
 y : float
 y coordinate of the agent being checked
 Returns
 -------
 True or False based on if the agent is satisfied or not.
 """
 race = np.extract(np.logical_and(np.equal(self.np_agenthouses[:
 , 0], x), np.equal(self.np_agenthouses[:, 1],
 y)), self.np_agenthouses[:, 2])[0]
 euclid_distance1 = round(math.hypot(self.spacing,
 self.spacing), 4)
 euclid_distance2 = self.spacing
 total_agents = \
 np.extract(np.logical_or(np.equal(np.round(np.hypot(self.np_agenthouses[:
 , 0] - x, self.np_agenthouses[:, 1] - y), 4),
 euclid_distance1),
 np.equal(np.round(np.hypot(self.np_agenthouses[:
 , 0] - x, self.np_agenthouses[:, 1] - y), 4),
 euclid_distance2)), self.np_agenthouses[:, 2])
 if total_agents.size == 0:
 return False
 else:
 return total_agents[total_agents == race].size \
 / total_agents.size < self.similarity_threshold
 def _move_to_empty(self, x, y):
 """Moves the agent to a new position if the agent is unsatisfied.
 Parameters
 ----------
 x : float
 x coordinate of the agent being checked
 y : float
 y coordinate of the agent being checked
 Returns
 ------- 
 None
 """
 race = np.extract(np.logical_and(np.equal(self.np_agenthouses[:
 , 0], x), np.equal(self.np_agenthouses[:, 1],
 y)), self.np_agenthouses[:, 2])[0]
 (x_new, y_new) = \
 self.np_emptyhouses[np.random.choice(self.np_emptyhouses.shape[0],
 1), :][0]
 self.np_agenthouses = \
 self.np_agenthouses[~np.logical_and(self.np_agenthouses[:,
 0] == x, self.np_agenthouses[:, 1]
 == y)]
 self.np_agenthouses = np.vstack([self.np_agenthouses, [x_new,
 y_new, race]])
 self.np_emptyhouses = \
 self.np_emptyhouses[~np.logical_and(self.np_emptyhouses[:,
 0] == x_new, self.np_emptyhouses[:, 1]
 == y_new)]
 self.np_emptyhouses = np.vstack([self.np_emptyhouses, [x, y]])
 def _update_helper(self, agent):
 """Helps the update function with number of changes made in every 
 iterations. 
 Parameters
 ----------
 agent : tuple
 x and y coordinates for the agent's position.
 Returns
 -------
 1 if the position of the agent is changed, 0 if not.
 """
 if self._is_unsatisfied(agent[0], agent[1]):
 self._move_to_empty(agent[0], agent[1])
 return 1
 else:
 return 0
 def update(self):
 """Main player in updating the array with all the agents' position.
 It updates the array until it reaches the iteration limit or the 
 number of changes become 0.
 Parameters
 ----------
 None
 Returns
 -------
 None
 """
 for i in np.arange(self.n_iterations):
 np_oldagenthouses = self.np_agenthouses.copy()
 n_changes = np.sum(np.apply_along_axis(self._update_helper,
 1, np_oldagenthouses))
 print('n Changes---->' + str(n_changes))
 print(i)
 if n_changes == 0:
 break
 def get_agenthouses(self):
 return self.np_agenthouses

geo_schelling_data.py

from shapely.geometry import Point
import pandas as pd
class geo_schelling_data:
 """Get the important data from the simulation required for the further 
 analysis. Following data is obtained:
 1) County Name
 2) Majority Population
 3) Minority Population
 4) Total Population
 5) Percentage of Majority Population
 6) Percentage of Minority Population
 Parameters
 ----------
 np_agenthouses : numpy array
 df_agenthouses converted to the numpy array for faster 
 computations.
 pd_shapefile : Pandas DataFrame
 Pandas DataFrame with a geometry column generated from the .shp
 file.
 Attributes
 ----------
 np_agenthouses : numpy array
 Same as in parameter section
 pd_shapefile : Pandas DataFrame
 Same as in parameter section
 df_county_data : Pandas DataFrame
 Pandas DataFrame with important data mentioned above.
 Methods
 -------
 _get_number_by_county()
 Private Function! It returns the number of majority and minority
 in a polygon.
 get_county_data()
 Returns the pandas DataFrame with the important data required for 
 the further analysis as mentioned above.
 """
 def __init__(self, np_agenthouses, pd_shapefile):
 self.np_agenthouses = np_agenthouses
 self.pd_shapefile = pd_shapefile
 self.df_county_data = pd.DataFrame([])
 def _get_number_by_county(self, geometry):
 """Returns the number of minority agents and majority agents within a 
 polygon. 
 Parameters
 ----------
 geometry : shapely.geometry.Polygon
 A polygon object which contains the geometry of a county in 
 a state.
 Returns
 -------
 A tuple with of minority agents and majority agents
 """
 num_majority = len([x[2] for x in list(self.np_agenthouses)
 if Point(x[0], x[1]).within(geometry)
 and x[2] == 1.0])
 num_minority = len([x[2] for x in list(self.np_agenthouses)
 if Point(x[0], x[1]).within(geometry)
 and x[2] == 2.0])
 return (num_majority, num_minority)
 def get_county_data(self):
 """Does calculation on Minority and Majority agents' number and returns
 a pandas dataframe for further analysis.
 Parameters
 ----------
 None
 Returns
 -------
 Pandas DataFrame
 """
 self.df_county_data['CountyName'] = self.pd_shapefile.NAMELSAD
 self.df_county_data['MajorityPop'] = \
 self.pd_shapefile.geometry.apply(lambda x: \
 self._get_number_by_county(x)[0])
 self.df_county_data['MinorityPop'] = \
 self.pd_shapefile.geometry.apply(lambda x: \
 self._get_number_by_county(x)[1])
 self.df_county_data['TotalPop'] = \
 self.df_county_data['MajorityPop'] \
 + self.df_county_data['MinorityPop']
 self.df_county_data['MajorityPopPercent'] = \
 self.df_county_data[['TotalPop', 'MajorityPop'
 ]].apply(lambda x: (0 if x['TotalPop']
 == 0 else x['MajorityPop'] / x['TotalPop']), axis=1)
 self.df_county_data['MinorityPopPercent'] = \
 self.df_county_data[['TotalPop', 'MinorityPop'
 ]].apply(lambda x: (0 if x['TotalPop']
 == 0 else x['MinorityPop'] / x['TotalPop']), axis=1)
 return self.df_county_data

geo_schelling_plot.py

from matplotlib import pyplot as plt 
class geo_schelling_plot:
 """Visualize the simulation as it helps us interpret and analyze the 
 results. 
 Parameters
 ----------
 np_agenthouses : numpy array
 df_agenthouses converted to the numpy array for faster 
 computations.
 pd_shapefile : Pandas Series
 Pandas Series of the geometry column generated from the .shp
 file.
 Attributes
 ----------
 np_agenthouses : numpy array
 Same as in parameter section
 pd_shapefile : Pandas Series
 Same as in parameter section
 Methods
 -------
 plot()
 Plots the agents and shape on the same graph.
 """
 def __init__(self, np_agenthouses, pd_geometry):
 self.np_agenthouses = np_agenthouses
 self.pd_geometry = pd_geometry
 def plot(self):
 """Plots the agents and shape on the same graph so that one can see
 where the different agents lies. It is very helpful in the case when 
 the simulation is finished and you can see the changes occured in 
 the plot.
 Yellow is majority
 Violet is minority
 Parameters
 ----------
 None
 Returns
 -------
 None
 """
 fig, ax = plt.subplots(figsize=(10, 10))
 self.pd_geometry.plot(ax=ax, color='white', edgecolor='black',linewidth=4)
 ax.scatter(self.np_agenthouses[:, 0], self.np_agenthouses[:,
 1], c=self.np_agenthouses[:, 2])
 ax.set_xticks([])
 ax.set_yticks([])

run.py

from geo_optimized.geo_schelling_data import geo_schelling_data
from geo_optimized.geo_schelling_plot import geo_schelling_plot
from geo_optimized.geo_schelling_populate import geo_schelling_populate
from geo_optimized.geo_schelling_update import geo_schelling_update
shapefile = "Path to the shapefile"
spacing = 0.05
empty_ratio = 0.2
similarity_threshhold = 0.65
n_iterations = 100
ratio = 0.41
if __name__ == '__main__':
 schelling_populate = geo_schelling_populate(shapefile,
 spacing,
 empty_ratio,
 ratio)
 df_emptyhouses, df_agenthouses = schelling_populate.populate()
 np_agenthouses = df_agenthouses.to_numpy()
 np_emptyhouses = df_emptyhouses.to_numpy()
 pd_geometry = schelling_populate.shape_file.geometry
 schelling_update = geo_schelling_update(n_iterations,
 spacing,
 np_agenthouses,
 np_emptyhouses,similarity_threshhold)
 schelling_update.update()
 np_agenthouses = schelling_update.get_agenthouses()
 pd_shapefile = schelling_populate.shape_file
 schelling_plot = geo_schelling_plot(np_agenthouses,pd_geometry)
 schelling_plot.plot()
 schelling_get_data = geo_schelling_data(np_agenthouses,pd_shapefile)
 df_county_data = schelling_get_data.get_county_data()

I got rid of most of the for loops and replace the code with pandas and numpy to do some faster looping and also some vectorize operations. I believe this code is much more cleaner and faster than the previous one but still some improvements in speed and documentation can be made.

I believe I can still make some operations vectorize but don't know how to do it. If someone can suggest those that would be great. Also if someone can help me with the better documentation of the code, that would be great too. Currently I am using numpy style docstrings.

Also if someone can help me optimize my code with numba to achieve C level speed that would be great. I believe that geo_schelling_update.py can be speed up with numba but I am unable to do so.

asked Sep 17, 2019 at 12:11
\$\endgroup\$
4
  • \$\begingroup\$ I can't find the shapefile on the link you posted \$\endgroup\$ Commented Sep 18, 2019 at 10:14
  • \$\begingroup\$ @MaartenFabré Please try now! \$\endgroup\$ Commented Sep 18, 2019 at 12:36
  • \$\begingroup\$ that is only the .shp. There are some associated files which are needed gis.stackexchange.com/a/262509 \$\endgroup\$ Commented Sep 18, 2019 at 12:39
  • \$\begingroup\$ Try now please! \$\endgroup\$ Commented Sep 18, 2019 at 12:43

2 Answers 2

4
\$\begingroup\$

Code style

Try to use an IDE which integrates with linters (Pycodestyle, Pylama, Mypy,...). This alone found some 97 warning, ranging from no whitespaces after a comma, trailing whitespace, redundant backslashes, closing brackets not matching the indentation,...

All of these are no big issues, but they make the code look messy, and are easy to fix. I use a code formatter (black, but there is also yapf) with a maximum line length of 79 to take care of these smaller issues for me

Classes should be in CapWords

The np_ prefix in some variable names is not helpful

Python is not JAVA

Not everything needs to be in a class, and not every class needs to be in a separate file.

pd.set_option('mode.chained_assignment', None)

This is a sign that you are doing something dangerous with views or copies, and data might be lost when you change a subset. It is better to use .loc then to make sure you get a copy, and not a view

ratio

You have 2 ratio's, so better call the second demographic_ratio or something

keyword-only arguments

Methods with a lot of arguments can cause confusion, and are called with the wrong order from time to time. To prevent this, use keyword-only arguments if there are a lot, especially if they are of the same type, so the code does not trow an error immediately, but just gives a garbage answer

occupation

df_allhouses["Agent/Empty"] lists whether a property is occupied. Since this is simply a flag, you can use 0 and 1 of True and False instead of "Empty" or "Agent" This will simplify a lot of the further processing. There is also no need to make this a column in df_allhouses.

I would also extract the method to provide this random population to a separate method:

def random_population(size, ratio):
 samples = np.zeros(size, dtype=np.bool)
 samples[: round(size * ratio)] = 1
 return np.random.permutation(samples)

So the houses that are ocupied are defined by occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)

architecture

What geo_schelling_populate does is provide an empty simulation, that you later populate. You never do anything else with this object, apart from getting the shapefile. A more logic architecture would be a method to read the shapefile, and another method to deliver a populated simulation. No need for the class, and no need for the extra file

This is an interesting talk: Stop writing classes by Jack Diederich

I will convert the example of the populate here, but the same way of working can be done for the other parts. No need for a class, with just an __init__ that populates the object variables, and then 1 action method that uses those object variables. It is better to just pass those as argument to a function

There is still a lot of other stuff to improve, especially on vectorisation, but I don't have time for that at this moment


from pathlib import Path
import geopandas as gp
import numpy as np
import pandas as pd
from shapely.geometry import Point
def _generate_points(polygon, spacing):
 """It returns a DataFrame with all the coordiantes inside a certain
 shape passed in as an parameter.
 Parameters
 ----------
 polygon : shapely.geometry.Polygon
 A polygon object which contains the geometry of a county in
 a state.
 Returns
 -------
 A pandas DataFrame with all the coordiantes generated inside the
 polygon object.
 """
 (minx, miny, maxx, maxy) = polygon.bounds
 x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)), spacing)
 y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)), spacing)
 grid = np.column_stack(
 (
 np.meshgrid(x_coords, y_coords)[0].flatten(),
 np.meshgrid(x_coords, y_coords)[1].flatten(),
 )
 )
 df_points = pd.DataFrame.from_records(grid, columns=["X", "Y"])
 df_points = df_points[
 df_points[["X", "Y"]].apply(
 lambda x: Point(x[0], x[1]).within(polygon), axis=1
 )
 ]
 return df_points.round(2)
def random_population(size, ratio):
 samples = np.zeros(size, dtype=np.bool)
 samples[: round(size * ratio)] = 1
 return np.random.permutation(samples)
def populate_simulation(
 *,
 shape_file: gp.GeoDataFrame,
 spacing: float,
 empty_ratio: float,
 demographic_ratio: float,
 races=2
):
 """provides a random populated simulation
 ...
 """
 all_houses = pd.concat(
 iter(
 shape_file.geometry.apply(lambda x: _generate_points(x, spacing))
 ),
 ignore_index=True,
 )
 occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)
 agent_houses = all_houses.loc[occupied, ["X", "Y"]].reset_index(drop=True)
 empty_houses = all_houses.loc[~occupied, ["X", "Y"]].reset_index(drop=True)
 agent_houses["Race"] = random_population(
 len(agent_houses), demographic_ratio
 )
 # +1 if you need it to be 1 and 2
 return empty_houses, agent_houses
if __name__ == "__main__":
 shapefilename = Path(r"../../data/CA_Counties_TIGER.shp")
 shape_file = gp.read_file(shapefilename) # no need
 spacing = 0.05
 empty_ratio = 0.2
 similarity_threshhold = 0.65
 n_iterations = 100
 demographic_ratio = 0.41
 empty_houses, agent_houses = populate_simulation(
 shape_file=shape_file,
 spacing=spacing,
 empty_ratio=empty_ratio,
 demographic_ratio=demographic_ratio,
 races=2,
 )
 ...

Part 2:

random seed

When simulating, always give the possibility to give a random seed, so you can repeat the same simulation to verify certain results.

Geopandas

You do a lot of distance calculations, and checks whether coordinates are withing Polygons semi-manually. It is a lot cleaner if you can let GeoPandas do that for you. If you keep the coordinates as Points, instead of x and y columns:

def generate_points(polygon, spacing):
 (minx, miny, maxx, maxy) = polygon.bounds
 x_coords = np.arange(np.floor(minx), (np.ceil(maxx)), spacing)
 y_coords = np.arange(np.floor(miny), (np.ceil(maxy)), spacing)
 grid_x, grid_y = map(np.ndarray.flatten, np.meshgrid(x_coords, y_coords))
 grid = gp.GeoSeries([Point(x, y) for x, y in zip(grid_x, grid_y)])
 return grid[grid.intersects(polygon)].reset_index(drop=True)

To get the houses in Los Angeles County:

generate_points(shape_file.loc[5, "geometry"], .05).plot()

Los Angeles County

grid creation

Best would be to extract the grid creation. This way you could in the future reuse the raw grid, with different simulation characteristics.

def create_grid(counties: gp.GeoSeries, spacing: float, random_seed=None):
 return gp.GeoDataFrame(
 pd.concat(
 {
 county.NAME: generate_points(county.geometry, spacing)
 for county in counties.itertuples()
 },
 names=["county"],
 )
 .rename("geometry")
 .reset_index(level="county")
 .reset_index(drop=True)
 .astype({"county": "category"}),
 geometry="geometry",
 )
all_houses = create_grid(counties=shape_file, spacing=spacing, random_seed=0)
def populate_simulation(
 *,
 all_houses,
 empty_ratio: float,
 demographic_ratio: float,
 races=2,
 random_seed=None,
):
 """provides a random populated simulation
 ...
 """
 if random_seed is not None:
 np.random.seed(random_seed)
 occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)
 race = random_population(size=int(occupied.sum()), ratio=demographic_ratio)
 agent_houses = gp.GeoDataFrame(
 {
 "race": race.astype(int),
 "county": all_houses.loc[occupied, "county"],
 },
 geometry=all_houses.loc[occupied, "geometry"].reset_index(drop=True),
 )
 empty_houses = all_houses[~occupied].reset_index(drop=True)
 return empty_houses, agent_houses
empty_houses, agent_houses = populate_simulation(
 all_houses=all_houses,
 empty_ratio=.1,
 demographic_ratio=.3,
 races=2,
 random_seed=0,
)

The agent_houses then looks like this:

 Race geometry
0 0 POINT (-4 -9)
1 0 POINT (-3 -9)
2 0 POINT (-2 -9)
3 1 POINT (-1 -9)
4 0 POINT (0 -9)

To plot this:

 agent_houses.plot(column="race", categorical=True, figsize=(10, 10))

California population

To check the neighbours who live within a perimeter of an agent is simple:

def get_neighbours(agent, agent_houses: gp.GeoDataFrame, spacing):
 """
 returns all the agents that live within `perimeter` of the `agent`
 The `agent` excluding"""
 surroundings = agent.geometry.buffer(spacing * 1.5) 
 return agent_houses.loc[
 agent_houses.intersects(surroundings)
 & (agent_houses != agent).any(axis=1)
 ]

this can be tested:

agent = agent_houses.loc[4]
neighbours = get_neighbours(agent, agent_houses, radius=0.05 * 5)
neighbours
 race county geometry
5 0 Sierra POINT (-120.4500000000001 39.44999999999997)
17 1 Sierra POINT (-120.5500000000001 39.49999999999997)
18 0 Sierra POINT (-120.5000000000001 39.49999999999997)
19 0 Sierra POINT (-120.4500000000001 39.49999999999997)
12321 0 Nevada POINT (-120.5500000000001 39.39999999999998)
12322 0 Nevada POINT (-120.5000000000001 39.39999999999998)
12323 0 Nevada POINT (-120.4500000000001 39.39999999999998)
12334 0 Nevada POINT (-120.5500000000001 39.44999999999997)

This is rather slow (500ms for a search among 15510 occupied houses)

Of you add x and y columns to a copy of the original:

agent_houses_b = agent_houses.assign(x=agent_houses.geometry.x, y=agent_houses.geometry.y)

and then use these column:

def get_neighbours3(agent, agent_houses: gp.GeoDataFrame, spacing):
 """
 returns all the agents that live within `perimeter` of the `agent`
 The `agent` excluding"""
 close_x = (agent_houses.x - agent.geometry.x).abs() < spacing * 1.1
 close_y = (agent_houses.y - agent.geometry.y).abs() < spacing * 1.1
 return agent_houses.loc[
 (agent_houses.index != agent.name) # skip the original agent
 & (close_x)
 & (close_y)
 ]
get_neighbours3(agent, agent_houses_b, spacing)

returns in about 3ms

This will possibly go even faster of you do it per county, and use DataFrame.groupby.transform if the people on the border of one county don't count as neighbours for people of a neighbouring county

To find out who is satisfied:

satisfied_agents = pd.Series(
 {
 id_: is_satisfied(
 agent=agent,
 agent_houses=agent_houses_b,
 spacing=spacing,
 similarity_threshold=similarity_threshold,
 )
 for id_, agent in agent_houses.iterrows()
 },
 name="satisfied",
)

value_counts

Checking whether an agent is satisfied becomes simple, just using pd.Series.value_counts

def is_satisfied(*, agent, agent_houses, spacing, similarity_threshold):
 neighbours = get_neighbours3(agent, agent_houses, spacing=spacing)
 if neighbours.empty:
 return False
 group_counts = neighbours["race"].value_counts()
 return group_counts.get(agent["race"], 0) / len(neighbours) < similarity_threshold

To get the count per race:

agent_houses.groupby(["race"])["geometry"].count().rename("count")

To get the count in a county:

agent_houses.groupby(["county", "race"])["geometry"].count().rename("count")

update

One iteration in the update can then be described as:

def update(agent_houses, empty_houses, spacing, similarity_threshold):
 agent_houses_b = agent_houses.assign(
 x=agent_houses.geometry.x, y=agent_houses.geometry.y
 )
 satisfied_agents = pd.Series(
 {
 id_: is_satisfied(
 agent=agent,
 agent_houses=agent_houses_b,
 spacing=spacing,
 similarity_threshold=similarity_threshold,
 )
 for id_, agent in agent_houses.iterrows()
 },
 name="satisfied",
 )
 open_houses = pd.concat(
 (
 agent_houses.loc[~satisfied_agents, ["county", "geometry"]],
 empty_houses,
 ),
 ignore_index=True,
 )
 new_picks = np.random.choice(
 open_houses.index, size=(~satisfied_agents).sum(), replace=False
 )
 new_agent_houses = agent_houses.copy()
 new_agent_houses.loc[
 ~satisfied_agents, ["county", "geometry"]
 ] = open_houses.loc[new_picks]
 new_empty_houses = open_houses.drop(new_picks).reset_index(drop=True)
 return new_empty_houses, new_agent_houses

This redistributes all the empty houses and the houses of the people unsatisfied, simulating a instantaneous move of all the unsatisfied people


part 3

spatial datastructures.

There are some datastructures which are explicitly meant for spatial data, and finding nearest neighbours.

a scipy.spatial.KDTree (or implemented in cython cKDTree) is specifically meant for stuff like this, and will speed up searches a lot when going to large grid.

from scipy.spatial import cKDTree
grid_x = np.array([grid["geometry"].x, grid["geometry"].y, ]).T
tree = cKDTree(grid_xy)

To query:

tree.query_ball_point((-120.4, 35.7), spacing * 1.5)

This query only takes 70μs for those 17233 grid points, which is 30 times faster than get_neighbours3.

You can even look for all neighbour pairs with tree.query_pairs(spacing * 1.5). This takes about as much time as 1 neighbour lookup in neighbours3

This means you can prepopulate a dict with all neighbours:

all_neighbours = defaultdict(list)
for i, j in tree.query_pairs(spacing * 1.5):
 all_neighbours[i].append(j)
 all_neighbours[j].append(i)

If you now keep the information on occupation and race in 2 separate numpy arrays, you can quickly look for all satisfied people:

occupied = random_population(size=len(grid), ratio=1 - empty_ratio)
race = random_population(size=int(occupied.sum()), ratio=demographic_ratio)
def is_satisfied2(agent, *, all_neighbours, occupied, race, similarity_index):
 if not occupied[agent] or agent not in all_neighbours:
 return False
 neighbours = all_neighbours[agent]
 neighbours_occupied = occupied[neighbours].sum()
 neighbours_same_race = (
 occupied[neighbours] & (race[neighbours] == race[agent])
 ).sum()
 return (neighbours_same_race / neighbours_occupied) > similarity_index

and all the satisfied people:

satisfied_agents = np.array(
 [
 is_satisfied2(
 agent,
 all_neighbours=all_neighbours,
 occupied=occupied,
 race=race,
 similarity_index=similarity_index,
 )
 for agent in np.arange(len(grid))
 ]
)

The people who want to move:

on_the_move = ~satisfied_agents & occupied

And the houses that are free is either free_houses = ~satisfied_agents or free_houses = ~occupied depending on your definition.

So update becomes as simple as:

def update(*, occupied, race, all_neighbours, similarity_index):
 satisfied_agents = np.array(
 [
 is_satisfied2(
 agent,
 all_neighbours=all_neighbours,
 occupied=occupied,
 race=race,
 similarity_index=similarity_index,
 )
 for agent in np.arange(len(grid))
 ]
 )
 on_the_move = ~satisfied_agents & occupied
 free_houses = ~satisfied_agents
 # or
 # free_houses = ~occupied
 assert len(on_the_move) <= len(free_houses) # they need a place to go
 new_houses = np.random.choice(
 free_houses, size=len(on_the_move), replace=False
 )
 new_occupied = occupied[:]
 new_occupied[on_the_move] = False
 new_occupied[new_houses] = True
 new_race = race[:]
 new_race[new_houses] = race[on_the_move]
 return new_occupied, new_race
answered Sep 17, 2019 at 15:46
\$\endgroup\$
5
  • \$\begingroup\$ Thank you for this. I really appreciate your answer. But np.random.binomial doesn't give me the exact number of 1s and 2s as I really need that \$\endgroup\$ Commented Sep 17, 2019 at 18:04
  • \$\begingroup\$ You are correct. I corrected my code \$\endgroup\$ Commented Sep 18, 2019 at 10:14
  • \$\begingroup\$ Thanks a lot for giving me all the different ideas here. I will definitely implement them in my code and hopefully make it run faster and also make it cleaner but I think that one thing you didn't understand about the simulation is that the update method can't work the way you are trying it to work. When an agent is unsatisfied, it needed to be moved to an randomly chosen empty space right away because lets say if we get 5000 unsatisfied agents and only have 2000 empty houses, where would be the other 3000 agents would go. An empty house need to be available all the time. \$\endgroup\$ Commented Sep 27, 2019 at 20:51
  • \$\begingroup\$ Which is why I decided the free houses as the ones not occupied by satisfied people, so a discontent person can move to a horse of another discontent \$\endgroup\$ Commented Sep 28, 2019 at 18:27
  • \$\begingroup\$ Also in your function populate_simulation, its not really populating Santa Barbara, LA and Ventura mainland \$\endgroup\$ Commented Sep 30, 2019 at 5:45
5
\$\begingroup\$

Documentation

The amount of documentation you've written is ambitious, but its arrangement is slightly unhelpful for a few reasons.

When documenting the "parameters to a class", you're really documenting the parameters to __init__. As such, this block:

 """
 Parameters
 ----------
 shapefile : str
 It is a string pointing to a geospatial vector data file which 
 has information for an american state's geometry data.
 spacing : float
 It defines the distance between the points in coordinate units.
 empty_ratio : float
 What percent of houses need to be kept empty.
 ratio : float
 What is the real ratio between the majority and minority in 
 reality in a given state.
 ratio : int
 Number of races. Currently the model is tested on 2 races, but 
 maybe in future support fot more races will be added.
 """

should be moved to the docstring of __init__.

Attributes generally aren't documented at all because they should mostly be private (prefixed with an underscore) and discouraged from public access except through functions. Regardless, such documentation should probably go into comments against the actual initialization of members in __init__.

Move the documentation for "Methods" to docstrings on each method.

Typo

coordiantes = coordinates

Modern IDEs have spell checking support, such as PyCharm.

Indentation

Perhaps moreso than in any other language, clear indentation in Python is critical. This:

 self.df_county_data['MinorityPopPercent'] = \
 self.df_county_data[['TotalPop', 'MinorityPop'
 ]].apply(lambda x: (0 if x['TotalPop']
 == 0 else x['MinorityPop'] / x['TotalPop']), axis=1)

could be better represented like so:

self.df_county_data['MinorityPopPercent'] = (
 self.df_county_data[
 ['TotalPop', 'MinorityPop']
 ].apply(
 lambda x: (
 0 if x['TotalPop'] == 0 else x['MinorityPop'] / x['TotalPop']
 ),
 axis=1
 )
)

That being said, this is somewhat over-extending the usefulness of a lambda; you're probably better off just writing a function to pass to apply.

answered Sep 17, 2019 at 13:20
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.