Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

🧐数据预处理🧐:缺失值处理、数据补全、数据类型转换、数据编码、数据正则化、对数变换...实战案例:Zillow房价数据清洗、Airbnb数据清洗

Notifications You must be signed in to change notification settings

teamowu/Feature-Engineering

Repository files navigation

Data Types

数据可以是定性的或定量的。

  • 定性数据:描述性信息(它描述某事物)
    • Categorical

例如:['它是黑色的', '它的毛很长', '它是好动的']

  • 定量数据:数值信息(Numeric)。
    • 离散数据:只可以是某些既定的值(例如整数)
      • 特点:可数(countable)
    • 连续数据:可以是在一个范围内的任何值
      • 特点:可测量(measurable)

例如:{'离散':['它有4条腿','它有两个兄弟'], '连续':['它的体重是25.5kg','它的身高是565mm']}


Common dtypes in DataSet

1.Numeric

  • Discrete: Count; Rating; Grade
  • Continuous: Revenue; Distance; Home Value

Watch OUT: data range!

2.Binary(Dummy)

只用0和1记数。

  • Special case of numeric

Watch OUT: IsMale ; HasHair ; Pass/Fail

3.Categorical

  • Usually contains characters : Gender, Product, Geo, etc.
  • Can be consist of pure numbers : SSN, Zipcode, Phone Number

Watch OUT: Valid Values

4.Dates and Time

  • Date, Time, Datetime, Timestamp

Watch OUT: Time Zone!

5.Missing

  • Null
    • Absence of everything; missing ; empty
  • Blank
    • ""or"" or anything invisible character
    • Can mean missing
    • Can mean "N/A"
  • N/A
    • Can mean "not available" : e.g.Age
    • Can mean "not applicable" : e.g.Middle Name
    • Can mean "no answer" : e.g.Customer Satisfaction Rating on a Questionnaire.

Data quality issue

数据集中可能包含部分数据是无效/无用的。

  • 不合理的记数:Incorrect / Invalid Entry
    • age = 203; gender = 'X'; price = -100; weekday = 8
  • 缺失值:Missing Data
    • N/A; Null; " " ;Unknown
  • 非结构化数据:Unstructured Data
    • merged cell ; double header; html
  • 歧义数据:Conflicting Data
    • revenue = 1000 ; unit = 0
  • 重复数据:Duplicates
    • double loading; double counting

Data Preparation Step

1.Data Access
2.Data Cleansing

  • 合并数据(integrate): integrate various data sources; integrate multiple columns
    • Merge sales units; sales revenue; price into one DataSet;
    • Combine year, month and date
  • 数据一致化(Conform): Conform the inconsistent values.
    • Na,n/a => missing
    • Los Angeles, L.A. => LA
  • 筛选(Filter): Filter out the columns and rows not needed for modeling
  • 组合(Group): Group many categorical values into a few buckets
  • 聚合(Aggregate): Aggregate/Disaggregate date to the desired dimensions.
  • 延申(Derive): Extract or Calculate new metrics based on existing metrics.
    • Price = Revenue/Units
    • Extract seasonality from sales
    • Regex

3.Handling Missing Data

  • 删除缺失值
  • 数据补全
  • 忽略缺失值

4.Identity Outlier
5.Transform Data(data preprocessing)

  • 无量纲化
    • 正则化(Normalization)
    • 标准化(Standardization)
    • 区间缩放法(MinMaxScaler)
  • 特征二元化
  • 独热编码
  • 对数变换

About

🧐数据预处理🧐:缺失值处理、数据补全、数据类型转换、数据编码、数据正则化、对数变换...实战案例:Zillow房价数据清洗、Airbnb数据清洗

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /