4
\$\begingroup\$

This simple script computes some basic descriptive statistics, like mean, standard deviation, kurtosis, etc. on data column imported from a CSV file with use of pandas. In addition, the script accepts argument --exclude_zeros and computes the desired statistics excluding zeros. The script delivers the desired results. However, as I come from R background, I would be happy receive feedback on a proper / pythonic way of generating the desired results.

Data

The data pertains to geographic area sizes of neighbourhood geographies for Scotland and is publicly available. This and other similar data sets can be sourced from Scottish Government open data portal.

#!/Users/me/path/path/path/bin/python
"""DZ Area check
The script sources uses previously used area size file and produces
some descriptive statistics. The script additionally computes statistics
excluding zeros.
"""
# Modules
# Refresh requirements creation:
# $ pipreqs --force ~/where/this/stuff/sits/
import os
import argparse
import pandas as pd
from tabulate import tabulate
import numpy as np
# Main function running the program
def main(csv_data, exclude):
 """Computer the desired area statisics"""
 data = pd.read_csv(
 filepath_or_buffer=csv_data,
 skiprows=7,
 encoding='utf-8',
 header=None,
 names=['datazone', 'usual_residenrs', 'area_hectares'])
 print('\nSourced table:\r')
 print(tabulate(data.head(), headers='keys', tablefmt='psql'))
 # Replace zero if required
 if exclude:
 data = data.replace(0, np.NaN)
 # Compute statistics
 area_mean = data.loc[:, "area_hectares"].mean()
 area_max = data.loc[:, "area_hectares"].max()
 area_min = data.loc[:, "area_hectares"].min()
 area_total = data.loc[:, "area_hectares"].sum()
 obs_count = data.loc[:, "area_hectares"].count()
 obs_dist = data.loc[:, "area_hectares"].nunique(
 ) # Count distinct observations
 area_variance = data.loc[:, "area_hectares"].var()
 area_median = data.loc[:, "area_hectares"].median()
 area_std = data.loc[:, "area_hectares"].std()
 area_skw = data.loc[:, "area_hectares"].skew()
 area_kurt = data.loc[:, "area_hectares"].kurtosis()
 # Create results object
 results = {
 'Statistic': [
 'Average', 'Max', 'Min', 'Total', 'Count', 'Count (distinct)',
 'Variance', 'Median', 'SD', 'Skewness', 'Kurtosis'
 ],
 'Value': [
 area_mean, area_max, area_min, area_total, obs_count, obs_dist,
 area_variance, area_median, area_std, area_skw, area_kurt
 ]
 }
 # Show results object
 print('\nArea statistics:\r')
 print(
 tabulate(
 results,
 headers='keys',
 tablefmt='psql',
 numalign='left',
 floatfmt='.2f'))
 return (results)
# Import arguments. Solves running program as a module and as a standalone
# file.
if __name__ == '__main__':
 parser = argparse.ArgumentParser(
 description='Calculate basic geography statistics.',
 epilog='Data Zone Area Statistics\rKonrad')
 parser.add_argument(
 '-i',
 '--infile',
 nargs=1,
 type=argparse.FileType('r'),
 help='Path to data file with geography statistics.',
 default=os.path.join('/Users', 'me', 'folder', 'data', 'folder',
 'import_folder', 'stuff.csv'))
 parser.add_argument(
 '--exclude-zeros',
 dest='exclude_zeros',
 action='store_true',
 default=False)
 args = parser.parse_args()
 # Call main function and computse stats
 main(csv_data=args.infile, exclude=args.exclude_zeros)

Results

Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1198.11 |
| Max | 116251.04 |
| Min | 0.00 |
| Total | 7793711.31 |
| Count | 6505.00 |
| Count (distinct) | 4200.00 |
| Variance | 35231279.23 |
| Median | 22.00 |
| SD | 5935.59 |
| Skewness | 9.77 |
| Kurtosis | 121.59 |
+------------------+-------------+

Results (excluding zeros)

Sourced table:
+----+------------+-------------------+-----------------+
| | datazone | usual_residenrs | area_hectares |
|----+------------+-------------------+-----------------|
| 0 | S01000001 | 872 | 438.88 |
| 1 | S01000002 | 678 | 30.77 |
| 2 | S01000003 | 788 | 13.36 |
| 3 | S01000004 | 612 | 20.08 |
| 4 | S01000005 | 643 | 27.02 |
+----+------------+-------------------+-----------------+
Area statistics:
+------------------+-------------+
| Statistic | Value |
|------------------+-------------|
| Average | 1199.03 |
| Max | 116251.04 |
| Min | 1.24 |
| Total | 7793711.31 |
| Count | 6500.00 |
| Count (distinct) | 4199.00 |
| Variance | 35257279.16 |
| Median | 22.01 |
| SD | 5937.78 |
| Skewness | 9.77 |
| Kurtosis | 121.49 |
+------------------+-------------+
asked Sep 30, 2018 at 16:14
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Suppose that we wanted to add a new statistic, what would we have to do? Well, we'd need to make three changes:

  1. Compute the statistic and put its value in a new variable:

    new_statistic = data.loc[:, "area_hectares"].new_statistic()
    
  2. Add the name of the new statistic to results['Statistic'].

  3. Add the new variable to results['Value'].

But when we do 1 and 3, there's a risk that we might put the name and value in different positions in the lists, causing the tabulated output to be wrong.

To avoid this risk, we'd like to have a single place to put the information about the new statistic. There are two things to know about a statistic: its name, and which function to call to compute it. So I would make a global table of statistics, like this:

# List of statistics to compute, as pairs (statistic name, method name).
STATISTICS = [
 ('Average', 'mean'),
 ('Max', 'max'),
 ('Min', 'min'),
 ('Total', 'sum'),
 ('Count', 'count'),
 ('Count (distinct)', 'nunique'),
 ('Variance', 'var'),
 ('Median', 'median'),
 ('SD', 'std'),
 ('Skewness', 'skew'),
 ('Kurtosis', 'kurtosis'),
]

And then it's easy to build the results dictionary by iterating over the global table and using operator.methodcaller:

from operator import methodcaller
column = data.loc[:, "area_hectares"]
results = {
 'Statistic': [name for name, _ in STATISTICS],
 'Value': [methodcaller(method)(column) for _, method in STATISTICS],
}

Now if we need to add a new statistic, we only need to make one change (adding a line to the STATISTICS list), and there's no risk of putting the name and value in different positions.

Konrad
3131 silver badge6 bronze badges
answered Oct 4, 2018 at 15:17
\$\endgroup\$
1
  • \$\begingroup\$ Thanks very much for your answer. I like your approach to computing additional statistics. As a matter of fact, this is what I would be looking to do in R using match.fun or get as in match.fun("mean") to compute mean. \$\endgroup\$ Commented Oct 5, 2018 at 6:53

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.