Commit d8e096f

committed

[SDP] Demo: spark-pipelines CLI

1 parent e53d60f commit d8e096fCopy full SHA for d8e096f

File tree

1 file changed

+162

-9

lines changed

docs/declarative-pipelines
- index.md

1 file changed

+162

-9

lines changed

`‎docs/declarative-pipelines/index.md`

Lines changed: 162 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -22,6 +22,159 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python`
`22`	`22`
`23`	`23`	`Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).`
`24`	`24`
	`25`	`+## Demo: spark-pipelines CLI`
	`26`	`+`
	`27`	+```bash
	`28`	`+uv init hello-spark-pipelines`
	`29`	+```
	`30`	`+`
	`31`	+```bash
	`32`	`+cd hello-spark-pipelines`
	`33`	+```
	`34`	`+`
	`35`	+```console
	`36`	`+❯ uv pip list`
	`37`	`+Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none`
	`38`	`+Package Version`
	`39`	`+------- -------`
	`40`	`+pip 24.3.1`
	`41`	+```
	`42`	`+`
	`43`	+```bash
	`44`	`+export SPARK_HOME=/Users/jacek/oss/spark`
	`45`	+```
	`46`	`+`
	`47`	+```bash
	`48`	`+uv add --editable $SPARK_HOME/python/packaging/client`
	`49`	+```
	`50`	`+`
	`51`	+```console
	`52`	`+❯ uv pip list`
	`53`	`+Package Version Editable project location`
	`54`	`+------------------------ ----------- ----------------------------------------------`
	`55`	`+googleapis-common-protos 1.70.0`
	`56`	`+grpcio 1.74.0`
	`57`	`+grpcio-status 1.74.0`
	`58`	`+numpy 2.3.2`
	`59`	`+pandas 2.3.1`
	`60`	`+protobuf 6.31.1`
	`61`	`+pyarrow 21.0.0`
	`62`	`+pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client`
	`63`	`+python-dateutil 2.9.0.post0`
	`64`	`+pytz 2025.2`
	`65`	`+pyyaml 6.0.2`
	`66`	`+six 1.17.0`
	`67`	`+tzdata 2025.2`
	`68`	+```
	`69`	`+`
	`70`	+```bash
	`71`	`+source .venv/bin/activate`
	`72`	+```
	`73`	`+`
	`74`	+```console
	`75`	`+❯ $SPARK_HOME/bin/spark-pipelines --help`
	`76`	`+usage: cli.py [-h] {run,dry-run,init} ...`
	`77`	`+`
	`78`	`+Pipelines CLI`
	`79`	`+`
	`80`	`+positional arguments:`
	`81`	`+ {run,dry-run,init}`
	`82`	`+ run Run a pipeline. If no refresh options specified, a`
	`83`	`+ default incremental update is performed.`
	`84`	`+ dry-run Launch a run that just validates the graph and checks`
	`85`	`+ for errors.`
	`86`	`+ init Generate a sample pipeline project, including a spec`
	`87`	`+ file and example definitions.`
	`88`	`+`
	`89`	`+options:`
	`90`	`+ -h, --help show this help message and exit`
	`91`	+```
	`92`	`+`
	`93`	+```console
	`94`	`+❯ $SPARK_HOME/bin/spark-pipelines dry-run`
	`95`	`+Traceback (most recent call last):`
	`96`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>`
	`97`	`+ main()`
	`98`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main`
	`99`	`+ spec_path = find_pipeline_spec(Path.cwd())`
	`100`	`+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
	`101`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec`
	`102`	`+ raise PySparkException(`
	`103`	+pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
	`104`	+```
	`105`	`+`
	`106`	+```bash
	`107`	`+$SPARK_HOME/bin/spark-pipelines init --name delete_me`
	`108`	+```
	`109`	`+`
	`110`	+```bash
	`111`	`+mv delete_me/* .; rm -rf delete_me`
	`112`	+```
	`113`	`+`
	`114`	+```console
	`115`	`+❯ cat pipeline.yml`
	`116`	`+`
	`117`	`+name: delete_me`
	`118`	`+definitions:`
	`119`	`+ - glob:`
	`120`	`+ include: transformations/*/.py`
	`121`	`+ - glob:`
	`122`	`+ include: transformations/*/.sql`
	`123`	+```
	`124`	`+`
	`125`	+```console
	`126`	`+❯ $SPARK_HOME/bin/spark-pipelines dry-run`
	`127`	`+2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...`
	`128`	`+2025年08月03日 15:17:08: Creating Spark session...`
	`129`	`+...`
	`130`	`+2025年08月03日 15:17:10: Creating dataflow graph...`
	`131`	`+2025年08月03日 15:17:10: Registering graph elements...`
	`132`	`+2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.`
	`133`	`+2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/*/.py'`
	`134`	`+2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
	`135`	`+2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/*/.sql'`
	`136`	`+2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
	`137`	`+2025年08月03日 15:17:11: Starting run...`
	`138`	`+2025年08月03日 13:17:11: Run is COMPLETED.`
	`139`	+```
	`140`	`+`
	`141`	+```console
	`142`	`+❯ $SPARK_HOME/bin/spark-pipelines run`
	`143`	`+...`
	`144`	`+2025年08月03日 15:17:58: Creating dataflow graph...`
	`145`	`+2025年08月03日 15:17:58: Registering graph elements...`
	`146`	`+2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.`
	`147`	`+2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/*/.py'`
	`148`	`+2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
	`149`	`+2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/*/.sql'`
	`150`	`+2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
	`151`	`+2025年08月03日 15:17:58: Starting run...`
	`152`	`+2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.`
	`153`	`+2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.`
	`154`	`+2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.`
	`155`	`+2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.`
	`156`	`+2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.`
	`157`	`+2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.`
	`158`	`+2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.`
	`159`	`+2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.`
	`160`	`+2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.`
	`161`	`+2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.`
	`162`	`+2025年08月03日 13:18:03: Run is COMPLETED.`
	`163`	+```
	`164`	`+`
	`165`	+```console
	`166`	`+❯ tree spark-warehouse`
	`167`	`+spark-warehouse`
	`168`	`+├── example_python_materialized_view`
	`169`	`+│ ├── _SUCCESS`
	`170`	`+│ └── part-00000-75bc5b01-aea2-4d05-a71c-5c04937981bc-c000.snappy.parquet`
	`171`	`+└── example_sql_materialized_view`
	`172`	`+ ├── _SUCCESS`
	`173`	`+ └── part-00000-e1d0d33c-5d9e-43c3-a87d-f5f772d32942-c000.snappy.parquet`
	`174`	`+`
	`175`	`+3 directories, 4 files`
	`176`	+```
	`177`	`+`
`25`	`178`	`## Spark Connect Only { #spark-connect }`
`26`	`179`
`27`	`180`	`Declarative Pipelines currently only supports Spark Connect.`
`@@ -44,7 +197,15 @@ Exception in thread "main" org.apache.spark.SparkUserAppException: User applicat`
`44`	`197`
`45`	`198`	`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
`46`	`199`
`47`		`-## Demo`
	`200`	`+## Dataset Types`
	`201`	`+`
	`202`	`+Declarative Pipelines supports the following dataset types:`
	`203`	`+`
	`204`	`+* Materialized Views (datasets) that are published to a catalog`
	`205`	`+* Table that are published to a catalog`
	`206`	`+* Views that are not published to a catalog`
	`207`	`+`
	`208`	`+## Demo: Scala API`
`48`	`209`
`49`	`210`	`### Step 1. Register Dataflow Graph`
`50`	`211`
`@@ -102,14 +263,6 @@ val updateCtx: PipelineUpdateContext =`
`102`	`263`	`updateCtx.pipelineExecution.runPipeline()`
`103`	`264`	```
`104`	`265`
`105`		`-## Dataset Types`
`106`		`-`
`107`		`-Declarative Pipelines supports the following dataset types:`
`108`		`-`
`109`		`-* Materialized Views (datasets) that are published to a catalog`
`110`		`-* Table that are published to a catalog`
`111`		`-* Views that are not published to a catalog`
`112`		`-`
`113`	`266`	`## Learning Resources`
`114`	`267`
`115`	`268`	`1. [Spark Declarative Pipelines Programming Guide](https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md)`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit d8e096f

File tree

1 file changed

1 file changed

`‎docs/declarative-pipelines/index.md`

0 commit comments