Python program computing some statistics on Scottish geographic areas

Question 1

This simple script computes some basic descriptive statistics, like mean, standard deviation, kurtosis, etc. on data column imported from a CSV file with use of pandas. In addition, the script accepts argument --exclude_zeros and computes the desired statistics excluding zeros. The script delivers the desired results. However, as I come from R background, I would be happy receive feedback on a proper / pythonic way of generating the desired results.

Data

The data pertains to geographic area sizes of neighbourhood geographies for Scotland and is publicly available. This and other similar data sets can be sourced from Scottish Government open data portal.

#!/Users/me/path/path/path/bin/python
"""DZ Area check
The script sources uses previously used area size file and produces
some descriptive statistics. The script additionally computes statistics
excluding zeros.
"""
# Modules
# Refresh requirements creation:
# $ pipreqs --force ~/where/this/stuff/sits/
import os
import argparse
import pandas as pd
from tabulate import tabulate
import numpy as np
# Main function running the program
def main(csv_data, exclude):
 """Computer the desired area statisics"""
 data = pd.read_csv(
 filepath_or_buffer=csv_data,
 skiprows=7,
 encoding='utf-8',
 header=None,
 names=['datazone', 'usual_residenrs', 'area_hectares'])
 print('\nSourced table:\r')
 print(tabulate(data.head(), headers='keys', tablefmt='psql'))
 # Replace zero if required
 if exclude:
 data = data.replace(0, np.NaN)
 # Compute statistics
 area_mean = data.loc[:, "area_hectares"].mean()
 area_max = data.loc[:, "area_hectares"].max()
 area_min = data.loc[:, "area_hectares"].min()
 area_total = data.loc[:, "area_hectares"].sum()
 obs_count = data.loc[:, "area_hectares"].count()
 obs_dist = data.loc[:, "area_hectares"].nunique(
 ) # Count distinct observations
 area_variance = data.loc[:, "area_hectares"].var()
 area_median = data.loc[:, "area_hectares"].median()
 area_std = data.loc[:, "area_hectares"].std()
 area_skw = data.loc[:, "area_hectares"].skew()
 area_kurt = data.loc[:, "area_hectares"].kurtosis()
 # Create results object
 results = {
 'Statistic': [
 'Average', 'Max', 'Min', 'Total', 'Count', 'Count (distinct)',
 'Variance', 'Median', 'SD', 'Skewness', 'Kurtosis'
 ],
 'Value': [
 area_mean, area_max, area_min, area_total, obs_count, obs_dist,
 area_variance, area_median, area_std, area_skw, area_kurt
 ]
 }
 # Show results object
 print('\nArea statistics:\r')
 print(
 tabulate(
 results,
 headers='keys',
 tablefmt='psql',
 numalign='left',
 floatfmt='.2f'))
 return (results)
# Import arguments. Solves running program as a module and as a standalone
# file.
if __name__ == '__main__':
 parser = argparse.ArgumentParser(
 description='Calculate basic geography statistics.',
 epilog='Data Zone Area Statistics\rKonrad')
 parser.add_argument(
 '-i',
 '--infile',
 nargs=1,
 type=argparse.FileType('r'),
 help='Path to data file with geography statistics.',
 default=os.path.join('/Users', 'me', 'folder', 'data', 'folder',
 'import_folder', 'stuff.csv'))
 parser.add_argument(
 '--exclude-zeros',
 dest='exclude_zeros',
 action='store_true',
 default=False)
 args = parser.parse_args()
 # Call main function and computse stats
 main(csv_data=args.infile, exclude=args.exclude_zeros)

Results

Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1198.11 |
| Max | 116251.04 |
| Min | 0.00 |
| Total | 7793711.31 |
| Count | 6505.00 |
| Count (distinct) | 4200.00 |
| Variance | 35231279.23 |
| Median | 22.00 |
| SD | 5935.59 |
| Skewness | 9.77 |
| Kurtosis | 121.59 |
+------------------+-------------+

Results (excluding zeros)

Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1199.03 |
| Max | 116251.04 |
| Min | 1.24 |
| Total | 7793711.31 |
| Count | 6500.00 |
| Count (distinct) | 4199.00 |
| Variance | 35257279.16 |
| Median | 22.01 |
| SD | 5937.78 |
| Skewness | 9.77 |
| Kurtosis | 121.49 |
+------------------+-------------+

Question 2

Suppose that we wanted to add a new statistic, what would we have to do? Well, we'd need to make three changes:

Compute the statistic and put its value in a new variable:

new_statistic = data.loc[:, "area_hectares"].new_statistic()

Add the name of the new statistic to results['Statistic'].
Add the new variable to results['Value'].

But when we do 1 and 3, there's a risk that we might put the name and value in different positions in the lists, causing the tabulated output to be wrong.

To avoid this risk, we'd like to have a single place to put the information about the new statistic. There are two things to know about a statistic: its name, and which function to call to compute it. So I would make a global table of statistics, like this:

# List of statistics to compute, as pairs (statistic name, method name).
STATISTICS = [
 ('Average', 'mean'),
 ('Max', 'max'),
 ('Min', 'min'),
 ('Total', 'sum'),
 ('Count', 'count'),
 ('Count (distinct)', 'nunique'),
 ('Variance', 'var'),
 ('Median', 'median'),
 ('SD', 'std'),
 ('Skewness', 'skew'),
 ('Kurtosis', 'kurtosis'),
]

And then it's easy to build the results dictionary by iterating over the global table and using operator.methodcaller:

from operator import methodcaller
column = data.loc[:, "area_hectares"]
results = {
 'Statistic': [name for name, _ in STATISTICS],
 'Value': [methodcaller(method)(column) for _, method in STATISTICS],
}

Now if we need to add a new statistic, we only need to make one change (adding a line to the STATISTICS list), and there's no risk of putting the name and value in different positions.

Question 3

Thanks very much for your answer. I like your approach to computing additional statistics. As a matter of fact, this is what I would be looking to do in R using match.fun or get as in match.fun("mean") to compute mean.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2018-10-04 15:17:28Z

Suppose that we wanted to add a new statistic, what would we have to do? Well, we'd need to make three changes:

Compute the statistic and put its value in a new variable:

new_statistic = data.loc[:, "area_hectares"].new_statistic()

Add the name of the new statistic to results['Statistic'].
Add the new variable to results['Value'].

But when we do 1 and 3, there's a risk that we might put the name and value in different positions in the lists, causing the tabulated output to be wrong.

To avoid this risk, we'd like to have a single place to put the information about the new statistic. There are two things to know about a statistic: its name, and which function to call to compute it. So I would make a global table of statistics, like this:

# List of statistics to compute, as pairs (statistic name, method name).
STATISTICS = [
 ('Average', 'mean'),
 ('Max', 'max'),
 ('Min', 'min'),
 ('Total', 'sum'),
 ('Count', 'count'),
 ('Count (distinct)', 'nunique'),
 ('Variance', 'var'),
 ('Median', 'median'),
 ('SD', 'std'),
 ('Skewness', 'skew'),
 ('Kurtosis', 'kurtosis'),
]

And then it's easy to build the results dictionary by iterating over the global table and using operator.methodcaller:

from operator import methodcaller
column = data.loc[:, "area_hectares"]
results = {
 'Statistic': [name for name, _ in STATISTICS],
 'Value': [methodcaller(method)(column) for _, method in STATISTICS],
}

Now if we need to add a new statistic, we only need to make one change (adding a line to the STATISTICS list), and there's no risk of putting the name and value in different positions.

Thanks very much for your answer. I like your approach to computing additional statistics. As a matter of fact, this is what I would be looking to do in R using match.fun or get as in match.fun("mean") to compute mean.

Stack Exchange Network

Python program computing some statistics on Scottish geographic areas

Results

Results (excluding zeros)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python program computing some statistics on Scottish geographic areas

Results

Results (excluding zeros)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions