Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fast, scalable package for routine attribute standardization tasks.

License

Notifications You must be signed in to change notification settings

beedrumms/attribute-boss

Repository files navigation

AttributeBoss

AttributeBoss Logo

Routine Data Standardization

AttributeBoss performs routine standardization and extraction tasks common to data transformation preprocessing requirements.

This library can:

  • standardize the treatment of missing data (missing_val_handler())
  • clean any str data (str_prep())
  • validate address, postal, and phone data (validate_attribute())
  • standardize address data by removing aligning use of prefix and suffixes and street directions (standardize_address())
  • standardize postal codes by removing common differences in representation (standardize_postal_code())
  • standardize phone numbers by removing common differences in reporesentation (standardize_phone_number())
  • standardize province and state names to abbreviated title (standardize_province_state())
  • extract an address (street number, street name, street prefix / suffix, street directions) from a complex str (extract_address())
  • extract a postal code from a str (extract_postal_code())
  • extract a phone number from a str (extract_phone_number())

see Suggested Steps for Implementation below for proper use

AttributeBoss Forever

Signing off but inevitably back on -B

How to Use

  • This library was built to quickly bring key attributes into alignment (Standardize) for further preprocessing tasks
  • It has Spark and Python functionality.

Spark

from AttributeBoss.SparkTools import SparkBoss ab = SparkBoss()

  • The Spark compatibilities of AttributeBoss use PySpark dataframe column instances to transform data.
  • To initialize:

Python

from AttributeBoss.PyTools import PyBoss ab = PyBoss()

  • The base python functions in AttributeBoss are built around the assumption that Pandas DataFrames are being used, and they are therefore required input types.
  • most of the functions leverage py dict structures to perform replacement tasks given that they are much more efficient than using for loops or step by step tasks
  • with this in mind, the input for each function should be a pd.Series unless otherwise stated
  • the output of each function is a pd.Series, which will make it easy to write it back to your DF
  • ex:
    • standardized_attribute = standardize_address(df['my_attribute'])
    • df['my_attribute] = standardized_attribute

Suggested Steps for Implementation

  • this library was designed to be used in steps and some functions rely on others to work properly
  • It is suggested that a user take the following approach:
    • for address data: missing_val_handler() --> standardize_address() --> extract_address() --> validate_attribute()
      • standardize_address() leverages cln_str() so it is not neccessary to use it
      • extract_address() is looking for a matches based off the standard prefix / suffix / street direction abbreviations implemented via standardize_address
      • validate_attribute() is looking only for NoneType data and returning a list of indices that match None -- in this way, extract_address() is actually validating your address data and returning None where it doesnt find a match.
    • any string: missing_val_handler() --> cln_strs()
      • cln_strs() performs basic cleaning in this order (the order is important): Changes str to uppercase, removes new lines, different types of apostrophies, periods, hyphens, parenthesis, adjusts commas. Replaces all types of french accents with standard letter. Removes all non-ascii characters, and finally remove any spaces more than one and trim any remaining whitespaces.
      • note that cln_strs() should be satisfactory for most categorical data -- however, it is not intended for text based data where puncuation matters!
    • postal, phone, or province data missing_val_handler() --> standardize_xxx() --> extract_xxx() --> validate_attribute()

About

Fast, scalable package for routine attribute standardization tasks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

AltStyle によって変換されたページ (->オリジナル) /