Commit 1c2b494

committed

[DP] Spark Pipelines CLI and Spark Connect commands

1 parent 3d5627b commit 1c2b494Copy full SHA for 1c2b494

File tree

2 files changed

+102

-3

lines changed

docs/declarative-pipelines
- SparkPipelines.md
- index.md

2 files changed

+102

-3

lines changed

`‎docs/declarative-pipelines/SparkPipelines.md`

Lines changed: 61 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -4,12 +4,20 @@ title: SparkPipelines`
`4`	`4`
`5`	`5`	`# SparkPipelines — Spark Pipelines CLI`
`6`	`6`
`7`		-`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
	`7`	+`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
`8`	`8`
`9`		-`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
	`9`	+`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
`10`	`10`
`11`	`11`	`## PySpark Pipelines CLI`
`12`	`12`
	`13`	+`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
	`14`	`+`
	`15`	`+The Pipelines CLI supports the following commands:`
	`16`	`+`
	`17`	`+* [dry-run](#dry-run)`
	`18`	`+* [init](#init)`
	`19`	`+* [run](#run)`
	`20`	`+`
`13`	`21`	`=== "uv run"`
`14`	`22`
`15`	`23`	```console
`@@ -61,3 +69,54 @@ Option \| Description \| Default`
`61`	`69`	`--full-refresh` \| List of datasets to reset and recompute (comma-separated) \| (empty)
`62`	`70`	`--full-refresh-all` \| Perform a full graph reset and recompute \| (undefined)
`63`	`71`	`--refresh` \| List of datasets to update (comma-separated) \| (empty)
	`72`	`+`
	`73`	+When executed, `run` prints out the following log message:
	`74`	`+`
	`75`	+```text
	`76`	`+Loading pipeline spec from [spec_path]...`
	`77`	+```
	`78`	`+`
	`79`	+`run` loads a pipeline spec.
	`80`	`+`
	`81`	+`run` prints out the following log message:
	`82`	`+`
	`83`	+```text
	`84`	`+Creating Spark session...`
	`85`	+```
	`86`	`+`
	`87`	+`run` creates a Spark session with the configurations from the pipeline spec.
	`88`	`+`
	`89`	+`run` prints out the following log message:
	`90`	`+`
	`91`	+```text
	`92`	`+Creating dataflow graph...`
	`93`	+```
	`94`	`+`
	`95`	+`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server.
	`96`	`+`
	`97`	`+!!! note "Spark Connect Server and Command Execution"`
	`98`	+ `CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server.
	`99`	`+`
	`100`	+`run` prints out the following log message:
	`101`	`+`
	`102`	+```text
	`103`	`+Dataflow graph created (ID: [dataflow_graph_id]).`
	`104`	+```
	`105`	`+`
	`106`	+`run` prints out the following log message:
	`107`	`+`
	`108`	+```text
	`109`	`+Registering graph elements...`
	`110`	+```
	`111`	`+`
	`112`	+`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`.
	`113`	`+`
	`114`	+`run` prints out the following log message:
	`115`	`+`
	`116`	+```text
	`117`	`+Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...`
	`118`	+```
	`119`	`+`
	`120`	+`run` sends a `StartRun` command for execution in the Spark Connect server.
	`121`	`+`
	`122`	+In the end, `run` keeps printing out pipeline events from the Spark Connect server.

`‎docs/declarative-pipelines/index.md`

Lines changed: 41 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -44,6 +44,30 @@ As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1da`
`44`	`44`	`from pyspark import pipelines as dp`
`45`	`45`	```
`46`	`46`
	`47`	`+## Pipeline Specification File`
	`48`	`+`
	`49`	`+The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).`
	`50`	`+`
	`51`	`+The following fields are supported:`
	`52`	`+`
	`53`	`+Field Name \| Description`
	`54`	`+-\|-`
	`55`	+ `name` (required) \| \|
	`56`	+ `catalog` \| \|
	`57`	+ `database` \| \|
	`58`	+ `schema` \| Alias of `database`. Used unless `database` is defined \|
	`59`	+ `configuration` \| \|
	`60`	+ `definitions` \| `glob` of `include`s \|
	`61`	`+`
	`62`	+```yaml
	`63`	`+name: hello-spark-pipelines`
	`64`	`+definitions:`
	`65`	`+ - glob:`
	`66`	`+ include: transformations/*/.py`
	`67`	`+ - glob:`
	`68`	`+ include: transformations/*/.sql`
	`69`	+```
	`70`	`+`
`47`	`71`	`## Python Decorators for Tables and Flows { #python-decorators }`
`48`	`72`
`49`	`73`	`Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:`
@@ -198,7 +222,7 @@ Run `spark-pipelines --help` to learn the options.
`198`	`222`	`=== "Command Line"`
`199`	`223`
`200`	`224`	```shell
`201`		`- $ $SPARK_HOME/bin/spark-pipelines --help`
	`225`	`+ $SPARK_HOME/bin/spark-pipelines --help`
`202`	`226`	```
`203`	`227`
`204`	`228`	`!!! note ""`
`@@ -272,6 +296,22 @@ transformations`
`272`	`296`	`1 directory, 2 files`
`273`	`297`	```
`274`	`298`
	`299`	`+!!! warning "Spark Connect Server should be down"`
	`300`	+ `spark-pipelines dry-run` starts its own Spark Connect Server at 15002 port (unless started with `--remote` option).
	`301`	`+`
	`302`	`+ Shut down Spark Connect Server if you started it already.`
	`303`	`+`
	`304`	+ ```shell
	`305`	`+ $SPARK_HOME/sbin/stop-connect-server.sh`
	`306`	+ ```
	`307`	`+`
	`308`	+!!! info "`--remote` option"
	`309`	+ Use `--remote` option to connect to a standalone Spark Connect Server.
	`310`	`+`
	`311`	+ ```shell
	`312`	`+ $SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run`
	`313`	+ ```
	`314`	`+`
`275`	`315`	`=== "Command Line"`
`276`	`316`
`277`	`317`	```shell

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit 1c2b494

File tree

2 files changed

2 files changed

`‎docs/declarative-pipelines/SparkPipelines.md`

`‎docs/declarative-pipelines/index.md`

0 commit comments