Home

Edwin Chan edited this page Sep 5, 2022 · 12 revisions

Development Guide

Getting Started on spark development (spark-branch!)

Installing system requirements (spark, java, anaconda)

Mac

Use this guide to install java, spark and anaconda on a m1 mac https://mungingdata.com/pyspark/install-delta-lake-jupyter-conda-mac/

Linux or Windows with WSL

export SPARK_VERSION=3.2.0
export SPARK_DIRECTORY=/opt/spark
export HADOOP_VERSION=2.7
mkdir -p ${SPARK_DIRECTORY}
sudo apt-get update
sudo apt-get -y install openjdk-8-jdk
curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
--output ${SPARK_DIRECTORY}/spark.tgz
cd ${SPARK_DIRECTORY} && tar -xvzf spark.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark

Installing python library requirements in a conda env. Pull [spark-branch](https://github.com/ydataai/pandas-profiling/tree/spark-branch) and run

conda env create -f venv/spark.yml

This creates your conda env for spark called spark-env with all requirements packed inside

then activate the environment using

source activate spark-env

Finally, run the command which should execute and provide profiling for some spark data

tests/backends/spark_backend/example.py

Don’t worry about any errors you see for now - as long as the report builds properly.

image

Add a custom sidebar

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Development Guide

Getting Started on spark development (spark-branch!)

Uh oh!

Clone this wiki locally