Commit 8ddde7f

authored

Simple ETL pipeline using pandas

1 parent 9a7d419 commit 8ddde7fCopy full SHA for 8ddde7f

File tree

+31

-0

lines changed

+31

-0

lines changed

Lines changed: 31 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,31 @@`
	`1`	`+import numpy as np`
	`2`	`+import pandas as pd`
	`3`	`+df = pd.DataFrame({`
	`4`	`+ "id": [100, 100, 101, 102, 103, 104, 105, 106],`
	`5`	`+ "A": [1, 2, 3, 4, 5, 2, np.nan, 5],`
	`6`	`+ "B": [45, 56, 48, 47, 62, 112, 54, 49],`
	`7`	`+ "C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]`
	`8`	`+})`
	`9`	`+df`
	`10`	`+def fill_missing_values(df):`
	`11`	`+ for col in df.select_dtypes(include= ["int","float"]).columns:`
	`12`	`+ val = df[col].mean()`
	`13`	`+ df[col].fillna(val, inplace=True)`
	`14`	`+ return df`
	`15`	`+def drop_duplicates(df, column_name):`
	`16`	`+ df = df.drop_duplicates(subset=column_name)`
	`17`	`+ return df`
	`18`	`+def remove_outliers(df, column_list):`
	`19`	`+ for col in column_list:`
	`20`	`+ avg = df[col].mean()`
	`21`	`+ std = df[col].std()`
	`22`	`+ low = avg - 2 * std`
	`23`	`+ high = avg + 2 * std`
	`24`	`+ df = df[df[col].between(low, high, inclusive=True)]`
	`25`	`+ return df`
	`26`	`+`
	`27`	`+df_processed = (df.`
	`28`	`+ pipe(fill_missing_values).`
	`29`	`+ pipe(drop_duplicates, "id").`
	`30`	`+ pipe(remove_outliers, ["A","B"]))`
	`31`	`+print(df_processed)`

Comments

(0)