This simple script computes some basic descriptive statistics, like mean, standard deviation, kurtosis, etc. on data column imported from a CSV file with use of pandas
. In addition, the script accepts argument --exclude_zeros
and computes the desired statistics excluding zeros. The script delivers the desired results. However, as I come from R background, I would be happy receive feedback on a proper / pythonic way of generating the desired results.
Data
The data pertains to geographic area sizes of neighbourhood geographies for Scotland and is publicly available. This and other similar data sets can be sourced from Scottish Government open data portal.
#!/Users/me/path/path/path/bin/python
"""DZ Area check
The script sources uses previously used area size file and produces
some descriptive statistics. The script additionally computes statistics
excluding zeros.
"""
# Modules
# Refresh requirements creation:
# $ pipreqs --force ~/where/this/stuff/sits/
import os
import argparse
import pandas as pd
from tabulate import tabulate
import numpy as np
# Main function running the program
def main(csv_data, exclude):
"""Computer the desired area statisics"""
data = pd.read_csv(
filepath_or_buffer=csv_data,
skiprows=7,
encoding='utf-8',
header=None,
names=['datazone', 'usual_residenrs', 'area_hectares'])
print('\nSourced table:\r')
print(tabulate(data.head(), headers='keys', tablefmt='psql'))
# Replace zero if required
if exclude:
data = data.replace(0, np.NaN)
# Compute statistics
area_mean = data.loc[:, "area_hectares"].mean()
area_max = data.loc[:, "area_hectares"].max()
area_min = data.loc[:, "area_hectares"].min()
area_total = data.loc[:, "area_hectares"].sum()
obs_count = data.loc[:, "area_hectares"].count()
obs_dist = data.loc[:, "area_hectares"].nunique(
) # Count distinct observations
area_variance = data.loc[:, "area_hectares"].var()
area_median = data.loc[:, "area_hectares"].median()
area_std = data.loc[:, "area_hectares"].std()
area_skw = data.loc[:, "area_hectares"].skew()
area_kurt = data.loc[:, "area_hectares"].kurtosis()
# Create results object
results = {
'Statistic': [
'Average', 'Max', 'Min', 'Total', 'Count', 'Count (distinct)',
'Variance', 'Median', 'SD', 'Skewness', 'Kurtosis'
],
'Value': [
area_mean, area_max, area_min, area_total, obs_count, obs_dist,
area_variance, area_median, area_std, area_skw, area_kurt
]
}
# Show results object
print('\nArea statistics:\r')
print(
tabulate(
results,
headers='keys',
tablefmt='psql',
numalign='left',
floatfmt='.2f'))
return (results)
# Import arguments. Solves running program as a module and as a standalone
# file.
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Calculate basic geography statistics.',
epilog='Data Zone Area Statistics\rKonrad')
parser.add_argument(
'-i',
'--infile',
nargs=1,
type=argparse.FileType('r'),
help='Path to data file with geography statistics.',
default=os.path.join('/Users', 'me', 'folder', 'data', 'folder',
'import_folder', 'stuff.csv'))
parser.add_argument(
'--exclude-zeros',
dest='exclude_zeros',
action='store_true',
default=False)
args = parser.parse_args()
# Call main function and computse stats
main(csv_data=args.infile, exclude=args.exclude_zeros)
Results
Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1198.11 |
| Max | 116251.04 |
| Min | 0.00 |
| Total | 7793711.31 |
| Count | 6505.00 |
| Count (distinct) | 4200.00 |
| Variance | 35231279.23 |
| Median | 22.00 |
| SD | 5935.59 |
| Skewness | 9.77 |
| Kurtosis | 121.59 |
+------------------+-------------+
Results (excluding zeros)
Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1199.03 |
| Max | 116251.04 |
| Min | 1.24 |
| Total | 7793711.31 |
| Count | 6500.00 |
| Count (distinct) | 4199.00 |
| Variance | 35257279.16 |
| Median | 22.01 |
| SD | 5937.78 |
| Skewness | 9.77 |
| Kurtosis | 121.49 |
+------------------+-------------+
1 Answer 1
Suppose that we wanted to add a new statistic, what would we have to do? Well, we'd need to make three changes:
Compute the statistic and put its value in a new variable:
new_statistic = data.loc[:, "area_hectares"].new_statistic()
Add the name of the new statistic to
results['Statistic']
.- Add the new variable to
results['Value']
.
But when we do 1 and 3, there's a risk that we might put the name and value in different positions in the lists, causing the tabulated output to be wrong.
To avoid this risk, we'd like to have a single place to put the information about the new statistic. There are two things to know about a statistic: its name, and which function to call to compute it. So I would make a global table of statistics, like this:
# List of statistics to compute, as pairs (statistic name, method name).
STATISTICS = [
('Average', 'mean'),
('Max', 'max'),
('Min', 'min'),
('Total', 'sum'),
('Count', 'count'),
('Count (distinct)', 'nunique'),
('Variance', 'var'),
('Median', 'median'),
('SD', 'std'),
('Skewness', 'skew'),
('Kurtosis', 'kurtosis'),
]
And then it's easy to build the results
dictionary by iterating over the global table and using operator.methodcaller
:
from operator import methodcaller
column = data.loc[:, "area_hectares"]
results = {
'Statistic': [name for name, _ in STATISTICS],
'Value': [methodcaller(method)(column) for _, method in STATISTICS],
}
Now if we need to add a new statistic, we only need to make one change (adding a line to the STATISTICS
list), and there's no risk of putting the name and value in different positions.
-
\$\begingroup\$ Thanks very much for your answer. I like your approach to computing additional statistics. As a matter of fact, this is what I would be looking to do in R using
match.fun
orget
as inmatch.fun("mean")
to compute mean. \$\endgroup\$Konrad– Konrad2018年10月05日 06:53:26 +00:00Commented Oct 5, 2018 at 6:53
Explore related questions
See similar questions with these tags.