I want to install external Python packages on EMR with an EC2 setup, but currently, apart from bootstrap actions, nothing else seems to be working. The problem with this setup is that if I want to include any new package, I have to create a new cluster with a new set of libraries in the bootstrap script. Lots of things will change because the cluster ID has to be hardcoded.
For this, I was doing something like below:
aws emr add-steps \
--cluster-id <cluster-id>\
--steps Type=Spark,Name="Run PySpark Framework",ActionOnFailure=CONTINUE,\
Args=[--deploy-mode,cluster,--master,yarn,\
--archives,s3://msk-nimbuspost-connectors/pyspark_env.tar.gz#pyspark_env,\
--py-files,s3://msk-nimbuspost-connectors/src.zip,\
s3://msk-connectors/main.py,\
--env, prod]
But this doesn't seem to work. The way I built my pyspark_env.tar.gz is like this.
Approach 1:
pip3 install venv-pack
venv-pack -f -o pyspark_env.tar.gz # This will zip all the packages installed under the virtual environment into the .tar.gz file
Approach 2:
I again tried creating a new folder, unzipping the contents from the above tar in that folder, and creating a new tar from that folder so that the path looks as follows:
(venv) @root-Pro % tar -tzf pyspark_env.tar.gz | head
pyspark_env/
pyspark_env/bin/
pyspark_env/include/
pyspark_env/pyspark_env.tar.gz
pyspark_env/pyvenv.cfg
pyspark_env/lib/
pyspark_env/share/
pyspark_env/share/py4j/
pyspark_env/share/py4j/py4j0.10.9.5.jar
pyspark_env/lib/python3.11/
Even this doesn't seem to be working. So, how can I achieve this?
requirements.txtand then using this file to install packages dynamically work? But how to refresh the bootstrap actions without creating a new cluster every time?