This is the repost of the following question as suggested by @HoboProber .
Again, if you don't know what is Schelling's model of segregation, you can read it here.
The Schelling model of segregation is an agent-based model that illustrates how individual tendencies regarding neighbors can lead to segregation. In the Schelling model, agents occupy cells of rectangular space. A cell can be occupied by a single agent only. Agents belong to one of two groups and are able to relocate according to the fraction of friends (i.e., agents of their own group) within a neighborhood around their location. The model's basic assumption is as follows: an agent, located in the center of a neighborhood where the fraction of friends f is less than a predefined tolerance threshold F (i.e., f < F), will try to relocate to a neighborhood for which the fraction of friends is at least f (i.e., f ≥ F)
You can get the shapefile here if you want to try it.
I have implemented most of the changes suggested and also did split up my code in 4 different classes. Here is my complete code.
geo_schelling_populate.py
import numpy as np
from shapely.geometry import Point
import geopandas as gp
import pandas as pd
class geo_schelling_populate:
""" Generate the coordinates in a polygon (In this case a map of the state)
on the basis of the given spacing and then randomly assign coordiantes
to different races and as empty houses. It also takes ratio and
empty_ratio in consideration.
Parameters
----------
shapefile : str
It is a string pointing to a geospatial vector data file which
has information for an american state's geometry data.
spacing : float
It defines the distance between the points in coordinate units.
empty_ratio : float
What percent of houses need to be kept empty.
ratio : float
What is the real ratio between the majority and minority in
reality in a given state.
ratio : int
Number of races. Currently the model is tested on 2 races, but
maybe in future support fot more races will be added.
Attributes
----------
shapefile : str
Same as in parameter section
spacing : float
Same as in parameter section
empty_ratio : float
Same as in parameter section
demographic_ratio : float
Same as ratio in the parameter section
races : int
Same as in parameter section
shape_file : geopandas.GeoDataFrame
Pandas DataFrame with a geometry column generated from the .shp
file
df_allhouses : pandas.DataFrame
Pandas DataFrame which contains all the different coordinates
generated inside a specific state.
df_emptyhouses : pandas.DataFrame
Pandas DataFrame which contains all the coordiantes which are
emptyhouses.
df_agenthouses : pandas.DataFrame
Pandas DataFrame which contains all the coordinates associated
with a race.
Methods
-------
_generate_points()
Private function! It generates the coordinates inside a given
shape. Returns a dataframe with all the coordiantes generated.
populate()
It populates the coordinates with either a race or denotes it
as an empty house. Returns a tuple with pandas dataframe of
empty houses and agent houses. Again agents are races in this
case.
"""
def __init__(
self,
shapefile,
spacing,
empty_ratio,
ratio,
races=2,
):
self.shapefile = shapefile
self.spacing = spacing
self.empty_ratio = empty_ratio
self.demographic_ratio = ratio
self.races = races
self.shape_file = \
gp.read_file(r"%s"%shapefile).explode().reset_index().drop(columns=['level_0'
, 'level_1'])
self.df_allhouses = pd.DataFrame([])
self.df_emptyhouses = pd.DataFrame([])
self.df_agenthouses = pd.DataFrame([])
def _generate_points(self, polygon):
"""It returns a DataFrame with all the coordiantes inside a certain
shape passed in as an parameter.
Parameters
----------
polygon : shapely.geometry.Polygon
A polygon object which contains the geometry of a county in
a state.
Returns
-------
A pandas DataFrame with all the coordiantes generated inside the
polygon object.
"""
(minx, miny, maxx, maxy) = polygon.bounds
x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)),
self.spacing)
y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)),
self.spacing)
grid = np.column_stack((np.meshgrid(x_coords,
y_coords)[0].flatten(),
np.meshgrid(x_coords,
y_coords)[1].flatten()))
df_points = pd.DataFrame.from_records(grid, columns=['X', 'Y'])
df_points = df_points[df_points[['X', 'Y']].apply(lambda x: \
Point(x[0], x[1]).within(polygon),
axis=1)]
return df_points.round(2)
def populate(self):
""" Populates the coordinates by assigning them a certain race or by
assigning them as empty houses.
Parameters
----------
No parameters
Returns
-------
A tuple which consist two pandas DataFrames. One contains
the coordiantes who have empty houses and the other contains
the coordiante with a race assigned to them.
"""
pd.set_option('mode.chained_assignment', None)
self.df_allhouses = \
pd.concat(iter(self.shape_file.geometry.apply(lambda x: \
self._generate_points(x))), ignore_index=True)
empty_ratio_seperator = round(self.empty_ratio
* len(self.df_allhouses))
self.df_allhouses['Agent/Empty'] = \
np.random.permutation(np.asarray(['Empty']
* empty_ratio_seperator
+ ['Agent'] * (len(self.df_allhouses)
- empty_ratio_seperator)))
self.df_emptyhouses = \
self.df_allhouses[self.df_allhouses['Agent/Empty']
== 'Empty'][['X', 'Y'
]].reset_index(drop=True)
self.df_agenthouses = \
self.df_allhouses[self.df_allhouses['Agent/Empty']
== 'Agent'][['X', 'Y'
]].reset_index(drop=True)
demographic_ratio_seperator = round(self.demographic_ratio
* len(self.df_agenthouses))
self.df_agenthouses['Race'] = \
np.random.permutation(np.asarray([1]
* demographic_ratio_seperator
+ [2] * (len(self.df_agenthouses)
- demographic_ratio_seperator)))
return (self.df_emptyhouses, self.df_agenthouses)
geo_schelling_update.py
import numpy as np
import math
class geo_schelling_update:
"""Updates the position of the races and the empty houses on the basis of
similarity threshold. Agents change positions if they have enough agents
from their own races in a certain neighbourhood. In this scenario,
neighbourhood is defined as the eight nearest coordinate around the
coordinate we are checking satisfaction for.
Parameters
----------
n_iterations : int
Maximum number of times, update method should run.
spacing : float
It defines the distance between the points in coordinate units.
np_agenthouses : numpy array
df_agenthouses converted to the numpy array for faster
computations.
np_emptyhouses : numpy array
df_emptyhouses converted to the numpy array for faster
computations.
similarity_threshold : float
What percent of similarity people want from their neighbours.
Attributes
----------
spacing : float
Same as in parameter section
n_iterations : int
Same as in parameter section
np_agenthouses : numpy array
Same as in parameter section
np_emptyhouses : numpy array
Same as in parameter section
similarity_threshold : float
Same as in parameter section
Methods
-------
_is_unsatisfied()
Private function! It checks if an agent is satisfied or not at
a certain position.
_move_to_empty()
Private function! Moves an unsatisified agent to a random empty
house.
_update_helper()
Private functions! A helper function to help update method with
numpy's apply_along_axis method.
update()
It updates array as long as it reaches maximum number of
iterations or reaches a point where no agent is unsatisfied.
Getter
------
get_agenthouses()
Returns an array with all the agents at a certain coordinate.
"""
def __init__(
self,
n_iterations,
spacing,
np_agenthouses,
np_emptyhouses,
similarity_threshold,
):
self.spacing = spacing
self.n_iterations = n_iterations
self.np_emptyhouses = np_emptyhouses
self.np_agenthouses = np_agenthouses
self.similarity_threshold = similarity_threshold
def _is_unsatisfied(self, x, y):
""" Checks if an agent is unsatisfied at a certain position.
Parameters
----------
x : float
x coordinate of the agent being checked
y : float
y coordinate of the agent being checked
Returns
-------
True or False based on if the agent is satisfied or not.
"""
race = np.extract(np.logical_and(np.equal(self.np_agenthouses[:
, 0], x), np.equal(self.np_agenthouses[:, 1],
y)), self.np_agenthouses[:, 2])[0]
euclid_distance1 = round(math.hypot(self.spacing,
self.spacing), 4)
euclid_distance2 = self.spacing
total_agents = \
np.extract(np.logical_or(np.equal(np.round(np.hypot(self.np_agenthouses[:
, 0] - x, self.np_agenthouses[:, 1] - y), 4),
euclid_distance1),
np.equal(np.round(np.hypot(self.np_agenthouses[:
, 0] - x, self.np_agenthouses[:, 1] - y), 4),
euclid_distance2)), self.np_agenthouses[:, 2])
if total_agents.size == 0:
return False
else:
return total_agents[total_agents == race].size \
/ total_agents.size < self.similarity_threshold
def _move_to_empty(self, x, y):
"""Moves the agent to a new position if the agent is unsatisfied.
Parameters
----------
x : float
x coordinate of the agent being checked
y : float
y coordinate of the agent being checked
Returns
-------
None
"""
race = np.extract(np.logical_and(np.equal(self.np_agenthouses[:
, 0], x), np.equal(self.np_agenthouses[:, 1],
y)), self.np_agenthouses[:, 2])[0]
(x_new, y_new) = \
self.np_emptyhouses[np.random.choice(self.np_emptyhouses.shape[0],
1), :][0]
self.np_agenthouses = \
self.np_agenthouses[~np.logical_and(self.np_agenthouses[:,
0] == x, self.np_agenthouses[:, 1]
== y)]
self.np_agenthouses = np.vstack([self.np_agenthouses, [x_new,
y_new, race]])
self.np_emptyhouses = \
self.np_emptyhouses[~np.logical_and(self.np_emptyhouses[:,
0] == x_new, self.np_emptyhouses[:, 1]
== y_new)]
self.np_emptyhouses = np.vstack([self.np_emptyhouses, [x, y]])
def _update_helper(self, agent):
"""Helps the update function with number of changes made in every
iterations.
Parameters
----------
agent : tuple
x and y coordinates for the agent's position.
Returns
-------
1 if the position of the agent is changed, 0 if not.
"""
if self._is_unsatisfied(agent[0], agent[1]):
self._move_to_empty(agent[0], agent[1])
return 1
else:
return 0
def update(self):
"""Main player in updating the array with all the agents' position.
It updates the array until it reaches the iteration limit or the
number of changes become 0.
Parameters
----------
None
Returns
-------
None
"""
for i in np.arange(self.n_iterations):
np_oldagenthouses = self.np_agenthouses.copy()
n_changes = np.sum(np.apply_along_axis(self._update_helper,
1, np_oldagenthouses))
print('n Changes---->' + str(n_changes))
print(i)
if n_changes == 0:
break
def get_agenthouses(self):
return self.np_agenthouses
geo_schelling_data.py
from shapely.geometry import Point
import pandas as pd
class geo_schelling_data:
"""Get the important data from the simulation required for the further
analysis. Following data is obtained:
1) County Name
2) Majority Population
3) Minority Population
4) Total Population
5) Percentage of Majority Population
6) Percentage of Minority Population
Parameters
----------
np_agenthouses : numpy array
df_agenthouses converted to the numpy array for faster
computations.
pd_shapefile : Pandas DataFrame
Pandas DataFrame with a geometry column generated from the .shp
file.
Attributes
----------
np_agenthouses : numpy array
Same as in parameter section
pd_shapefile : Pandas DataFrame
Same as in parameter section
df_county_data : Pandas DataFrame
Pandas DataFrame with important data mentioned above.
Methods
-------
_get_number_by_county()
Private Function! It returns the number of majority and minority
in a polygon.
get_county_data()
Returns the pandas DataFrame with the important data required for
the further analysis as mentioned above.
"""
def __init__(self, np_agenthouses, pd_shapefile):
self.np_agenthouses = np_agenthouses
self.pd_shapefile = pd_shapefile
self.df_county_data = pd.DataFrame([])
def _get_number_by_county(self, geometry):
"""Returns the number of minority agents and majority agents within a
polygon.
Parameters
----------
geometry : shapely.geometry.Polygon
A polygon object which contains the geometry of a county in
a state.
Returns
-------
A tuple with of minority agents and majority agents
"""
num_majority = len([x[2] for x in list(self.np_agenthouses)
if Point(x[0], x[1]).within(geometry)
and x[2] == 1.0])
num_minority = len([x[2] for x in list(self.np_agenthouses)
if Point(x[0], x[1]).within(geometry)
and x[2] == 2.0])
return (num_majority, num_minority)
def get_county_data(self):
"""Does calculation on Minority and Majority agents' number and returns
a pandas dataframe for further analysis.
Parameters
----------
None
Returns
-------
Pandas DataFrame
"""
self.df_county_data['CountyName'] = self.pd_shapefile.NAMELSAD
self.df_county_data['MajorityPop'] = \
self.pd_shapefile.geometry.apply(lambda x: \
self._get_number_by_county(x)[0])
self.df_county_data['MinorityPop'] = \
self.pd_shapefile.geometry.apply(lambda x: \
self._get_number_by_county(x)[1])
self.df_county_data['TotalPop'] = \
self.df_county_data['MajorityPop'] \
+ self.df_county_data['MinorityPop']
self.df_county_data['MajorityPopPercent'] = \
self.df_county_data[['TotalPop', 'MajorityPop'
]].apply(lambda x: (0 if x['TotalPop']
== 0 else x['MajorityPop'] / x['TotalPop']), axis=1)
self.df_county_data['MinorityPopPercent'] = \
self.df_county_data[['TotalPop', 'MinorityPop'
]].apply(lambda x: (0 if x['TotalPop']
== 0 else x['MinorityPop'] / x['TotalPop']), axis=1)
return self.df_county_data
geo_schelling_plot.py
from matplotlib import pyplot as plt
class geo_schelling_plot:
"""Visualize the simulation as it helps us interpret and analyze the
results.
Parameters
----------
np_agenthouses : numpy array
df_agenthouses converted to the numpy array for faster
computations.
pd_shapefile : Pandas Series
Pandas Series of the geometry column generated from the .shp
file.
Attributes
----------
np_agenthouses : numpy array
Same as in parameter section
pd_shapefile : Pandas Series
Same as in parameter section
Methods
-------
plot()
Plots the agents and shape on the same graph.
"""
def __init__(self, np_agenthouses, pd_geometry):
self.np_agenthouses = np_agenthouses
self.pd_geometry = pd_geometry
def plot(self):
"""Plots the agents and shape on the same graph so that one can see
where the different agents lies. It is very helpful in the case when
the simulation is finished and you can see the changes occured in
the plot.
Yellow is majority
Violet is minority
Parameters
----------
None
Returns
-------
None
"""
fig, ax = plt.subplots(figsize=(10, 10))
self.pd_geometry.plot(ax=ax, color='white', edgecolor='black',linewidth=4)
ax.scatter(self.np_agenthouses[:, 0], self.np_agenthouses[:,
1], c=self.np_agenthouses[:, 2])
ax.set_xticks([])
ax.set_yticks([])
run.py
from geo_optimized.geo_schelling_data import geo_schelling_data
from geo_optimized.geo_schelling_plot import geo_schelling_plot
from geo_optimized.geo_schelling_populate import geo_schelling_populate
from geo_optimized.geo_schelling_update import geo_schelling_update
shapefile = "Path to the shapefile"
spacing = 0.05
empty_ratio = 0.2
similarity_threshhold = 0.65
n_iterations = 100
ratio = 0.41
if __name__ == '__main__':
schelling_populate = geo_schelling_populate(shapefile,
spacing,
empty_ratio,
ratio)
df_emptyhouses, df_agenthouses = schelling_populate.populate()
np_agenthouses = df_agenthouses.to_numpy()
np_emptyhouses = df_emptyhouses.to_numpy()
pd_geometry = schelling_populate.shape_file.geometry
schelling_update = geo_schelling_update(n_iterations,
spacing,
np_agenthouses,
np_emptyhouses,similarity_threshhold)
schelling_update.update()
np_agenthouses = schelling_update.get_agenthouses()
pd_shapefile = schelling_populate.shape_file
schelling_plot = geo_schelling_plot(np_agenthouses,pd_geometry)
schelling_plot.plot()
schelling_get_data = geo_schelling_data(np_agenthouses,pd_shapefile)
df_county_data = schelling_get_data.get_county_data()
I got rid of most of the for
loops and replace the code with pandas and numpy to do some faster looping and also some vectorize operations.
I believe this code is much more cleaner and faster than the previous one but still some improvements in speed and documentation can be made.
I believe I can still make some operations vectorize but don't know how to do it. If someone can suggest those that would be great. Also if someone can help me with the better documentation of the code, that would be great too. Currently I am using numpy style docstrings.
Also if someone can help me optimize my code with numba
to achieve C level speed that would be great. I believe that geo_schelling_update.py can be speed up with numba but I am unable to do so.
2 Answers 2
Code style
Try to use an IDE which integrates with linters (Pycodestyle, Pylama, Mypy,...). This alone found some 97 warning, ranging from no whitespaces after a comma, trailing whitespace, redundant backslashes, closing brackets not matching the indentation,...
All of these are no big issues, but they make the code look messy, and are easy to fix. I use a code formatter (black, but there is also yapf) with a maximum line length of 79 to take care of these smaller issues for me
Classes should be in CapWords
The np_
prefix in some variable names is not helpful
Python is not JAVA
Not everything needs to be in a class, and not every class needs to be in a separate file.
pd.set_option('mode.chained_assignment', None)
This is a sign that you are doing something dangerous with views or copies, and data might be lost when you change a subset. It is better to use .loc
then to make sure you get a copy, and not a view
ratio
You have 2 ratio's, so better call the second demographic_ratio
or something
keyword-only arguments
Methods with a lot of arguments can cause confusion, and are called with the wrong order from time to time. To prevent this, use keyword-only arguments if there are a lot, especially if they are of the same type, so the code does not trow an error immediately, but just gives a garbage answer
occupation
df_allhouses["Agent/Empty"]
lists whether a property is occupied. Since this is simply a flag, you can use 0 and 1 of True
and False
instead of "Empty"
or "Agent"
This will simplify a lot of the further processing. There is also no need to make this a column in df_allhouses
.
I would also extract the method to provide this random population to a separate method:
def random_population(size, ratio):
samples = np.zeros(size, dtype=np.bool)
samples[: round(size * ratio)] = 1
return np.random.permutation(samples)
So the houses that are ocupied are defined by occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)
architecture
What geo_schelling_populate
does is provide an empty simulation, that you later populate. You never do anything else with this object, apart from getting the shapefile. A more logic architecture would be a method to read the shapefile, and another method to deliver a populated simulation. No need for the class, and no need for the extra file
This is an interesting talk: Stop writing classes by Jack Diederich
I will convert the example of the populate here, but the same way of working can be done for the other parts. No need for a class, with just an __init__
that populates the object variables, and then 1 action method that uses those object variables. It is better to just pass those as argument to a function
There is still a lot of other stuff to improve, especially on vectorisation, but I don't have time for that at this moment
from pathlib import Path
import geopandas as gp
import numpy as np
import pandas as pd
from shapely.geometry import Point
def _generate_points(polygon, spacing):
"""It returns a DataFrame with all the coordiantes inside a certain
shape passed in as an parameter.
Parameters
----------
polygon : shapely.geometry.Polygon
A polygon object which contains the geometry of a county in
a state.
Returns
-------
A pandas DataFrame with all the coordiantes generated inside the
polygon object.
"""
(minx, miny, maxx, maxy) = polygon.bounds
x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)), spacing)
y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)), spacing)
grid = np.column_stack(
(
np.meshgrid(x_coords, y_coords)[0].flatten(),
np.meshgrid(x_coords, y_coords)[1].flatten(),
)
)
df_points = pd.DataFrame.from_records(grid, columns=["X", "Y"])
df_points = df_points[
df_points[["X", "Y"]].apply(
lambda x: Point(x[0], x[1]).within(polygon), axis=1
)
]
return df_points.round(2)
def random_population(size, ratio):
samples = np.zeros(size, dtype=np.bool)
samples[: round(size * ratio)] = 1
return np.random.permutation(samples)
def populate_simulation(
*,
shape_file: gp.GeoDataFrame,
spacing: float,
empty_ratio: float,
demographic_ratio: float,
races=2
):
"""provides a random populated simulation
...
"""
all_houses = pd.concat(
iter(
shape_file.geometry.apply(lambda x: _generate_points(x, spacing))
),
ignore_index=True,
)
occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)
agent_houses = all_houses.loc[occupied, ["X", "Y"]].reset_index(drop=True)
empty_houses = all_houses.loc[~occupied, ["X", "Y"]].reset_index(drop=True)
agent_houses["Race"] = random_population(
len(agent_houses), demographic_ratio
)
# +1 if you need it to be 1 and 2
return empty_houses, agent_houses
if __name__ == "__main__":
shapefilename = Path(r"../../data/CA_Counties_TIGER.shp")
shape_file = gp.read_file(shapefilename) # no need
spacing = 0.05
empty_ratio = 0.2
similarity_threshhold = 0.65
n_iterations = 100
demographic_ratio = 0.41
empty_houses, agent_houses = populate_simulation(
shape_file=shape_file,
spacing=spacing,
empty_ratio=empty_ratio,
demographic_ratio=demographic_ratio,
races=2,
)
...
Part 2:
random seed
When simulating, always give the possibility to give a random seed, so you can repeat the same simulation to verify certain results.
Geopandas
You do a lot of distance calculations, and checks whether coordinates are withing Polygons semi-manually. It is a lot cleaner if you can let GeoPandas
do that for you. If you keep the coordinates as Point
s, instead of x and y columns:
def generate_points(polygon, spacing):
(minx, miny, maxx, maxy) = polygon.bounds
x_coords = np.arange(np.floor(minx), (np.ceil(maxx)), spacing)
y_coords = np.arange(np.floor(miny), (np.ceil(maxy)), spacing)
grid_x, grid_y = map(np.ndarray.flatten, np.meshgrid(x_coords, y_coords))
grid = gp.GeoSeries([Point(x, y) for x, y in zip(grid_x, grid_y)])
return grid[grid.intersects(polygon)].reset_index(drop=True)
To get the houses in Los Angeles County:
generate_points(shape_file.loc[5, "geometry"], .05).plot()
grid creation
Best would be to extract the grid creation. This way you could in the future reuse the raw grid, with different simulation characteristics.
def create_grid(counties: gp.GeoSeries, spacing: float, random_seed=None):
return gp.GeoDataFrame(
pd.concat(
{
county.NAME: generate_points(county.geometry, spacing)
for county in counties.itertuples()
},
names=["county"],
)
.rename("geometry")
.reset_index(level="county")
.reset_index(drop=True)
.astype({"county": "category"}),
geometry="geometry",
)
all_houses = create_grid(counties=shape_file, spacing=spacing, random_seed=0)
def populate_simulation(
*,
all_houses,
empty_ratio: float,
demographic_ratio: float,
races=2,
random_seed=None,
):
"""provides a random populated simulation
...
"""
if random_seed is not None:
np.random.seed(random_seed)
occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)
race = random_population(size=int(occupied.sum()), ratio=demographic_ratio)
agent_houses = gp.GeoDataFrame(
{
"race": race.astype(int),
"county": all_houses.loc[occupied, "county"],
},
geometry=all_houses.loc[occupied, "geometry"].reset_index(drop=True),
)
empty_houses = all_houses[~occupied].reset_index(drop=True)
return empty_houses, agent_houses
empty_houses, agent_houses = populate_simulation( all_houses=all_houses, empty_ratio=.1, demographic_ratio=.3, races=2, random_seed=0, )
The agent_houses
then looks like this:
Race geometry 0 0 POINT (-4 -9) 1 0 POINT (-3 -9) 2 0 POINT (-2 -9) 3 1 POINT (-1 -9) 4 0 POINT (0 -9)
To plot this:
agent_houses.plot(column="race", categorical=True, figsize=(10, 10))
To check the neighbours who live within a perimeter of an agent is simple:
def get_neighbours(agent, agent_houses: gp.GeoDataFrame, spacing):
"""
returns all the agents that live within `perimeter` of the `agent`
The `agent` excluding"""
surroundings = agent.geometry.buffer(spacing * 1.5)
return agent_houses.loc[
agent_houses.intersects(surroundings)
& (agent_houses != agent).any(axis=1)
]
this can be tested:
agent = agent_houses.loc[4]
neighbours = get_neighbours(agent, agent_houses, radius=0.05 * 5)
neighbours
race county geometry 5 0 Sierra POINT (-120.4500000000001 39.44999999999997) 17 1 Sierra POINT (-120.5500000000001 39.49999999999997) 18 0 Sierra POINT (-120.5000000000001 39.49999999999997) 19 0 Sierra POINT (-120.4500000000001 39.49999999999997) 12321 0 Nevada POINT (-120.5500000000001 39.39999999999998) 12322 0 Nevada POINT (-120.5000000000001 39.39999999999998) 12323 0 Nevada POINT (-120.4500000000001 39.39999999999998) 12334 0 Nevada POINT (-120.5500000000001 39.44999999999997)
This is rather slow (500ms for a search among 15510 occupied houses)
Of you add x and y columns to a copy of the original:
agent_houses_b = agent_houses.assign(x=agent_houses.geometry.x, y=agent_houses.geometry.y)
and then use these column:
def get_neighbours3(agent, agent_houses: gp.GeoDataFrame, spacing):
"""
returns all the agents that live within `perimeter` of the `agent`
The `agent` excluding"""
close_x = (agent_houses.x - agent.geometry.x).abs() < spacing * 1.1
close_y = (agent_houses.y - agent.geometry.y).abs() < spacing * 1.1
return agent_houses.loc[
(agent_houses.index != agent.name) # skip the original agent
& (close_x)
& (close_y)
]
get_neighbours3(agent, agent_houses_b, spacing)
returns in about 3ms
This will possibly go even faster of you do it per county, and use DataFrame.groupby.transform
if the people on the border of one county don't count as neighbours for people of a neighbouring county
To find out who is satisfied:
satisfied_agents = pd.Series(
{
id_: is_satisfied(
agent=agent,
agent_houses=agent_houses_b,
spacing=spacing,
similarity_threshold=similarity_threshold,
)
for id_, agent in agent_houses.iterrows()
},
name="satisfied",
)
value_counts
Checking whether an agent is satisfied becomes simple, just using pd.Series.value_counts
def is_satisfied(*, agent, agent_houses, spacing, similarity_threshold):
neighbours = get_neighbours3(agent, agent_houses, spacing=spacing)
if neighbours.empty:
return False
group_counts = neighbours["race"].value_counts()
return group_counts.get(agent["race"], 0) / len(neighbours) < similarity_threshold
To get the count per race:
agent_houses.groupby(["race"])["geometry"].count().rename("count")
To get the count in a county:
agent_houses.groupby(["county", "race"])["geometry"].count().rename("count")
update
One iteration in the update can then be described as:
def update(agent_houses, empty_houses, spacing, similarity_threshold):
agent_houses_b = agent_houses.assign(
x=agent_houses.geometry.x, y=agent_houses.geometry.y
)
satisfied_agents = pd.Series(
{
id_: is_satisfied(
agent=agent,
agent_houses=agent_houses_b,
spacing=spacing,
similarity_threshold=similarity_threshold,
)
for id_, agent in agent_houses.iterrows()
},
name="satisfied",
)
open_houses = pd.concat(
(
agent_houses.loc[~satisfied_agents, ["county", "geometry"]],
empty_houses,
),
ignore_index=True,
)
new_picks = np.random.choice(
open_houses.index, size=(~satisfied_agents).sum(), replace=False
)
new_agent_houses = agent_houses.copy()
new_agent_houses.loc[
~satisfied_agents, ["county", "geometry"]
] = open_houses.loc[new_picks]
new_empty_houses = open_houses.drop(new_picks).reset_index(drop=True)
return new_empty_houses, new_agent_houses
This redistributes all the empty houses and the houses of the people unsatisfied, simulating a instantaneous move of all the unsatisfied people
part 3
spatial datastructures.
There are some datastructures which are explicitly meant for spatial data, and finding nearest neighbours.
a scipy.spatial.KDTree (or implemented in cython cKDTree) is specifically meant for stuff like this, and will speed up searches a lot when going to large grid.
from scipy.spatial import cKDTree
grid_x = np.array([grid["geometry"].x, grid["geometry"].y, ]).T
tree = cKDTree(grid_xy)
To query:
tree.query_ball_point((-120.4, 35.7), spacing * 1.5)
This query only takes 70μs for those 17233 grid points, which is 30 times faster than get_neighbours3
.
You can even look for all neighbour pairs with tree.query_pairs(spacing * 1.5)
. This takes about as much time as 1 neighbour lookup in neighbours3
This means you can prepopulate a dict with all neighbours:
all_neighbours = defaultdict(list)
for i, j in tree.query_pairs(spacing * 1.5):
all_neighbours[i].append(j)
all_neighbours[j].append(i)
If you now keep the information on occupation and race in 2 separate numpy arrays, you can quickly look for all satisfied people:
occupied = random_population(size=len(grid), ratio=1 - empty_ratio)
race = random_population(size=int(occupied.sum()), ratio=demographic_ratio)
def is_satisfied2(agent, *, all_neighbours, occupied, race, similarity_index):
if not occupied[agent] or agent not in all_neighbours:
return False
neighbours = all_neighbours[agent]
neighbours_occupied = occupied[neighbours].sum()
neighbours_same_race = (
occupied[neighbours] & (race[neighbours] == race[agent])
).sum()
return (neighbours_same_race / neighbours_occupied) > similarity_index
and all the satisfied people:
satisfied_agents = np.array(
[
is_satisfied2(
agent,
all_neighbours=all_neighbours,
occupied=occupied,
race=race,
similarity_index=similarity_index,
)
for agent in np.arange(len(grid))
]
)
The people who want to move:
on_the_move = ~satisfied_agents & occupied
And the houses that are free is either free_houses = ~satisfied_agents
or free_houses = ~occupied
depending on your definition.
So update becomes as simple as:
def update(*, occupied, race, all_neighbours, similarity_index):
satisfied_agents = np.array(
[
is_satisfied2(
agent,
all_neighbours=all_neighbours,
occupied=occupied,
race=race,
similarity_index=similarity_index,
)
for agent in np.arange(len(grid))
]
)
on_the_move = ~satisfied_agents & occupied
free_houses = ~satisfied_agents
# or
# free_houses = ~occupied
assert len(on_the_move) <= len(free_houses) # they need a place to go
new_houses = np.random.choice(
free_houses, size=len(on_the_move), replace=False
)
new_occupied = occupied[:]
new_occupied[on_the_move] = False
new_occupied[new_houses] = True
new_race = race[:]
new_race[new_houses] = race[on_the_move]
return new_occupied, new_race
-
\$\begingroup\$ Thank you for this. I really appreciate your answer. But
np.random.binomial
doesn't give me the exact number of 1s and 2s as I really need that \$\endgroup\$Kartikeya Sharma– Kartikeya Sharma2019年09月17日 18:04:00 +00:00Commented Sep 17, 2019 at 18:04 -
\$\begingroup\$ You are correct. I corrected my code \$\endgroup\$Maarten Fabré– Maarten Fabré2019年09月18日 10:14:45 +00:00Commented Sep 18, 2019 at 10:14
-
\$\begingroup\$ Thanks a lot for giving me all the different ideas here. I will definitely implement them in my code and hopefully make it run faster and also make it cleaner but I think that one thing you didn't understand about the simulation is that the update method can't work the way you are trying it to work. When an agent is unsatisfied, it needed to be moved to an randomly chosen empty space right away because lets say if we get 5000 unsatisfied agents and only have 2000 empty houses, where would be the other 3000 agents would go. An empty house need to be available all the time. \$\endgroup\$Kartikeya Sharma– Kartikeya Sharma2019年09月27日 20:51:30 +00:00Commented Sep 27, 2019 at 20:51
-
\$\begingroup\$ Which is why I decided the free houses as the ones not occupied by satisfied people, so a discontent person can move to a horse of another discontent \$\endgroup\$Maarten Fabré– Maarten Fabré2019年09月28日 18:27:35 +00:00Commented Sep 28, 2019 at 18:27
-
\$\begingroup\$ Also in your function populate_simulation, its not really populating Santa Barbara, LA and Ventura mainland \$\endgroup\$Kartikeya Sharma– Kartikeya Sharma2019年09月30日 05:45:05 +00:00Commented Sep 30, 2019 at 5:45
Documentation
The amount of documentation you've written is ambitious, but its arrangement is slightly unhelpful for a few reasons.
When documenting the "parameters to a class", you're really documenting the parameters to __init__
. As such, this block:
"""
Parameters
----------
shapefile : str
It is a string pointing to a geospatial vector data file which
has information for an american state's geometry data.
spacing : float
It defines the distance between the points in coordinate units.
empty_ratio : float
What percent of houses need to be kept empty.
ratio : float
What is the real ratio between the majority and minority in
reality in a given state.
ratio : int
Number of races. Currently the model is tested on 2 races, but
maybe in future support fot more races will be added.
"""
should be moved to the docstring of __init__
.
Attributes generally aren't documented at all because they should mostly be private (prefixed with an underscore) and discouraged from public access except through functions. Regardless, such documentation should probably go into comments against the actual initialization of members in __init__
.
Move the documentation for "Methods" to docstrings on each method.
Typo
coordiantes
= coordinates
Modern IDEs have spell checking support, such as PyCharm.
Indentation
Perhaps moreso than in any other language, clear indentation in Python is critical. This:
self.df_county_data['MinorityPopPercent'] = \
self.df_county_data[['TotalPop', 'MinorityPop'
]].apply(lambda x: (0 if x['TotalPop']
== 0 else x['MinorityPop'] / x['TotalPop']), axis=1)
could be better represented like so:
self.df_county_data['MinorityPopPercent'] = (
self.df_county_data[
['TotalPop', 'MinorityPop']
].apply(
lambda x: (
0 if x['TotalPop'] == 0 else x['MinorityPop'] / x['TotalPop']
),
axis=1
)
)
That being said, this is somewhat over-extending the usefulness of a lambda; you're probably better off just writing a function to pass to apply
.
Explore related questions
See similar questions with these tags.
.shp
. There are some associated files which are needed gis.stackexchange.com/a/262509 \$\endgroup\$