Commit 914f422

committed

finalize the data pipeline

1 parent f07e2bb commit 914f422Copy full SHA for 914f422

File tree

-19

lines changed

-19

lines changed

Lines changed: 3 additions & 18 deletions

Original file line number	Diff line number	Diff line change
`@@ -23,30 +23,15 @@`
`23`	`23`	`catchup=False`
`24`	`24`	`)`
`25`	`25`
`26`		`-"""`
`27`		`-Approach 1: ETL pipeline`
`28`		`-- Extract raw data`
`29`		`-- Load into local storage/Amazon S3 (Optional)`
`30`		`-- Transform locally with pandas`
`31`		`-- Reload into local storage/Amazon S3`
`32`		`-- Load from S3 to Amazon Redshift`
`33`		`-`
`34`		`-Approach 2: ELT pipeline`
`35`		`-- Extract raw data`
`36`		`-- Load into Amazon S3`
`37`		`-- Transform with AWS Glue`
`38`		`-- Load the transformed data to Amazon Redshift`
`39`		`-"""`
`40`		`-`
`41`	`26`	`# Extract raw data directly from API and store in local/cloud storage`
`42`	`27`	`extract = PythonOperator(`
`43`	`28`	`task_id="extract_contest_ranking",`
`44`	`29`	`python_callable=extract_contest_ranking,`
`45`		`- op_args=[2],`
	`30`	`+ op_args=[4],`
`46`	`31`	`dag=dag`
`47`	`32`	`)`
`48`	`33`
`49`		`-# Approach 1: Transform the data and reload into local/cloud storage`
	`34`	`+# Approach 1: Transform the data locally using pandas and reload into local/cloud storage`
`50`	`35`	`transform = PythonOperator(`
`51`	`36`	`task_id="transform_contest_ranking",`
`52`	`37`	`python_callable=transform_contest_ranking,`
`@@ -55,4 +40,4 @@`
`55`	`40`
`56`	`41`	`extract >> transform`
`57`	`42`
`58`		`-# Approach 2 is done with AWS`
	`43`	`+# Approach 2: Transform the data using AWS Glue and reload into cloud storage`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -14,7 +14,7 @@ def extract_contest_ranking(num_pages):`
`14`	`14`	`response = requests.post(URL, json=contest_ranking_query(i + 1)).json()["data"]["globalRanking"]["rankingNodes"]`
`15`	`15`	`responses.extend(response)`
`16`	`16`	`# output_path = f"{OUTPUT_PATH}/raw/sample_contest_ranking.csv" # Local file path for sample output data`
`17`		`- output_path = f"s3://{BUCKET_NAME}/raw_contest_ranking.csv" # Amazon S3 storage path`
	`17`	`+ output_path = f"s3://{BUCKET_NAME}/raw/contest_ranking.csv" # Amazon S3 storage path`
`18`	`18`	`pd.DataFrame(responses).to_csv(output_path, index=False)`
`19`	`19`	`return output_path`
`20`	`20`

Comments

(0)