Data Algorithms with Spark by Mahmoud Parsian
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."
Dr. Matei Zaharia
Original Creator of Apache Spark
FOREWORD by Dr. Matei Zaharia
Foreword by Dr. Matei Zaharia (Original Creator of Apache Spark)
Author: Mahmoud Parsian
-
This new O'Reilly book is the successor Edition of Data Algorithms (published by O'Reilly)
-
This book uses PySpark (much simpler and readable)
-
@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian
-
Autor Contact: [ Email Email ] [ Linkedin Mahmoud Parsian @LinkedIn ][ GitHub Mahmoud Parsian @GitHub ]
-
This GitHub repository will host all source code and scripts for Data Algorithms with Spark
-
Chapter solutions are provided in PySpark and Scala
- PySpark solutions are provided by Mahmoud Parsian
- Scala solutions are provided by Deepak Kumar and Biman Mandal
All programs are tested with the following software:
Spark | Python | Scala | Java |
---|---|---|---|
Apache Spark 3.4.0 | Python 3.10.5 | Scala 2.13 | Java 11 |
Chapter | Title |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Chapter 1 | Introduction to Data Algorithms |
Chapter 2 | Transformations in Action |
Chapter 3 | Mapper Transformations |
Chapter 4 | Reductions in Spark |
Chapter 5 | Partitioning Data |
Chapter 6 | Graph Algorithms |
Chapter 7 | Interacting with External Data Sources |
Chapter 8 | Ranking Algorithms |
Chapter 9 | Fundamental Data Design Patterns |
Chapter 10 | Common Data Design Patterns |
Chapter 11 | Join Design Patterns |
Chapter 12 | Feature Engineering in PySpark |
Bonus Chapter | Title / Description |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Word Count | Solutions for Word Count using RDDs and DataFrames |
Anagrams | Find words, which are anagrams |
Lambda Expressions | Using Lambda Expressions in PySpark programs |
TF-IDF | Term Frequency - Inverse Document Frequency |
K-mers | K-mers for DNA Sequences |
Correlation | All vs. All Correlation |
Mapping Partitions | mapPartitions() Complete Example |
UDF | User-Defined Function Examples |
DataFrames Transformations | Examples on Creation and Transformation of DataFrames |
DataFrames Tutorials | DataFrames Tutorials: from collections and CSV text files |
Join Operations | Examples on join of RDDs and DataFrames |
PySpark Tutorial 101 | Examples on using PySpark RDDs and DataFrames |
Physical Data Partitioning | Tutorial of Physical Data Partitioning |
Monoids and Combiners | Monoid as a Design Principle |
Data Algorithms with Spark Data Algorithms with Spark Data Algorithms with Spark