Commit 4741467

committed

[SDP] Python Import Alias Convention (dp) + Demo improvements

1 parent f6f646a commit 4741467Copy full SHA for 4741467

File tree

2 files changed

+168

-132

lines changed

docs/declarative-pipelines
- index.md
mkdocs.yml

2 files changed

+168

-132

lines changed

`‎docs/declarative-pipelines/index.md`

Lines changed: 167 additions & 132 deletions

Original file line number	Diff line number	Diff line change
`@@ -36,12 +36,20 @@ Declarative Pipelines uses [Python decorators](#python-decorators) to describe t`
`36`	`36`
`37`	`37`	`Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).`
`38`	`38`
	`39`	`+## Python Import Alias Convention`
	`40`	`+`
	`41`	+As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
	`42`	`+`
	`43`	+```python
	`44`	`+from pyspark import pipelines as dp`
	`45`	+```
	`46`	`+`
`39`	`47`	`## Python Decorators for Datasets and Flows { #python-decorators }`
`40`	`48`
`41`	`49`	`Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:`
`42`	`50`
`43`		`-* [@sdp.materialized_view](#materialized_view) for materialized views`
`44`		`-* [@sdp.table](#table) for streaming and batch tables`
	`51`	`+* [@dp.materialized_view](#materialized_view) for materialized views`
	`52`	`+* [@dp.table](#table) for streaming and batch tables`
`45`	`53`
`46`	`54`	`### pyspark.pipelines Python Module { #pyspark_pipelines }`
`47`	`55`
`@@ -56,35 +64,87 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python`
`56`	`64`	`Use the following import in your Python code:`
`57`	`65`
`58`	`66`	```py
`59`		`-from pyspark import pipelines as sdp`
	`67`	`+from pyspark import pipelines as dp`
	`68`	+```
	`69`	`+`
	`70`	`+### @dp.append_flow { #append_flow }`
	`71`	`+`
	`72`	`+### @dp.create_streaming_table { #create_streaming_table }`
	`73`	`+`
	`74`	`+### @dp.materialized_view { #materialized_view }`
	`75`	`+`
	`76`	`+### @dp.table { #table }`
	`77`	`+`
	`78`	`+### @dp.temporary_view { #temporary_view }`
	`79`	`+`
	`80`	`+## Demo: Create Virtual Environment for Python Client`
	`81`	`+`
	`82`	+```shell
	`83`	`+uv init hello-spark-pipelines && cd hello-spark-pipelines`
	`84`	+```
	`85`	`+`
	`86`	+```shell
	`87`	`+export SPARK_HOME=/Users/jacek/oss/spark`
`60`	`88`	```
`61`	`89`
`62`		`-### @sdp.append_flow { #append_flow }`
	`90`	+```shell
	`91`	`+uv add --editable $SPARK_HOME/python/packaging/client`
	`92`	+```
`63`	`93`
`64`		`-### @sdp.create_streaming_table { #create_streaming_table }`
	`94`	+```shell
	`95`	`+uv pip list`
	`96`	+```
`65`	`97`
`66`		`-### @sdp.materialized_view { #materialized_view }`
	`98`	`+??? note "Output"`
`67`	`99`
`68`		`-### @sdp.table { #table }`
	`100`	+ ```text
	`101`	`+ Package Version Editable project location`
	`102`	`+ ------------------------ ----------- ----------------------------------------------`
	`103`	`+ googleapis-common-protos 1.70.0`
	`104`	`+ grpcio 1.74.0`
	`105`	`+ grpcio-status 1.74.0`
	`106`	`+ numpy 2.3.2`
	`107`	`+ pandas 2.3.1`
	`108`	`+ protobuf 6.31.1`
	`109`	`+ pyarrow 21.0.0`
	`110`	`+ pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client`
	`111`	`+ python-dateutil 2.9.0.post0`
	`112`	`+ pytz 2025.2`
	`113`	`+ pyyaml 6.0.2`
	`114`	`+ six 1.17.0`
	`115`	`+ tzdata 2025.2`
	`116`	+ ```
`69`	`117`
`70`		`-### @sdp.temporary_view { #temporary_view }`
	`118`	+Activate (_source_) the virtual environment (that `uv` helped us create).
	`119`	`+`
	`120`	+```shell
	`121`	`+source .venv/bin/activate`
	`122`	+```
	`123`	`+`
	`124`	`+This activation brings all the necessary PySpark modules that have not been released yet and are only available in the source format only (incl. Spark Declarative Pipelines).`
`71`	`125`
`72`	`126`	`## Demo: Python API`
`73`	`127`
	`128`	`+??? warning "Activate Virtual Environment"`
	`129`	`+ Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.`
	`130`	`+`
`74`	`131`	`In a terminal, start a Spark Connect Server.`
`75`	`132`
`76`		-```bash
	`133`	+```shell
`77`	`134`	`./sbin/start-connect-server.sh`
`78`	`135`	```
`79`	`136`
`80`	`137`	`It will listen on port 15002.`
`81`	`138`
`82`		`-??? note "Tip"`
`83`		- Review the logs with `tail -f`.
	`139`	`+??? note "Monitor Logs"`
	`140`	`+`
	`141`	+ ```shell
	`142`	`+ tail -f logs/org.apache.spark.sql.connect.service.SparkConnectServer.out`
	`143`	+ ```
`84`	`144`
`85`	`145`	`Start a Spark Connect-enabled PySpark shell.`
`86`	`146`
`87`		-```bash
	`147`	+```shell
`88`	`148`	`$SPARK_HOME/bin/pyspark --remote sc://localhost:15002`
`89`	`149`	```
`90`	`150`
`@@ -107,13 +167,13 @@ registry = SparkConnectGraphElementRegistry(spark, dataflow_graph_id)`
`107`	`167`	```
`108`	`168`
`109`	`169`	```py
`110`		`-from pyspark import pipelines as sdp`
	`170`	`+from pyspark import pipelines as dp`
`111`	`171`	```
`112`	`172`
`113`	`173`	```py
`114`	`174`	`from pyspark.pipelines.graph_element_registry import graph_element_registration_context`
`115`	`175`	`with graph_element_registration_context(registry):`
`116`		`- sdp.create_streaming_table("demo_streaming_table")`
	`176`	`+ dp.create_streaming_table("demo_streaming_table")`
`117`	`177`	```
`118`	`178`
`119`	`179`	`You should see the following INFO message in the logs of the Spark Connect Server:`
`@@ -128,95 +188,63 @@ INFO PipelinesHandler: Define pipelines dataset cmd received: define_dataset {`
`128`	`188`
`129`	`189`	`## Demo: spark-pipelines CLI`
`130`	`190`
`131`		-```bash
`132`		`-uv init hello-spark-pipelines`
`133`		-```
	`191`	`+??? warning "Activate Virtual Environment"`
	`192`	`+ Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.`
`134`	`193`
`135`		-```bash
`136`		`-cd hello-spark-pipelines`
`137`		-```
	`194`	+Run `spark-pipelines --help` to learn the options.
`138`	`195`
`139`		-```console
`140`		`-❯ uv pip list`
`141`		`-Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none`
`142`		`-Package Version`
`143`		`-------- -------`
`144`		`-pip 24.3.1`
`145`		-```
	`196`	`+=== "Command Line"`
`146`	`197`
`147`		-```bash
`148`		`-exportSPARK_HOME=/Users/jacek/oss/spark`
`149`		-```
	`198`	+```shell
	`199`	`+$ $SPARK_HOME/bin/spark-pipelines --help`
	`200`	+```
`150`	`201`
`151`		-```bash
`152`		`-uv add --editable $SPARK_HOME/python/packaging/client`
`153`		-```
	`202`	`+ !!! note ""`
`154`	`203`
`155`		-```console
`156`		`-❯ uv pip list`
`157`		`-Package Version Editable project location`
`158`		`------------------------- ----------- ----------------------------------------------`
`159`		`-googleapis-common-protos 1.70.0`
`160`		`-grpcio 1.74.0`
`161`		`-grpcio-status 1.74.0`
`162`		`-numpy 2.3.2`
`163`		`-pandas 2.3.1`
`164`		`-protobuf 6.31.1`
`165`		`-pyarrow 21.0.0`
`166`		`-pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client`
`167`		`-python-dateutil 2.9.0.post0`
`168`		`-pytz 2025.2`
`169`		`-pyyaml 6.0.2`
`170`		`-six 1.17.0`
`171`		`-tzdata 2025.2`
`172`		-```
	`204`	+ ```text
	`205`	`+ usage: cli.py [-h] {run,dry-run,init} ...`
`173`	`206`
`174`		-Activate (_source_) the virtual environment (that `uv` helped us create).
`175`		`-It will bring all the necessary PySpark modules that have not been released yet and are only available in the source format only.`
	`207`	`+ Pipelines CLI`
`176`	`208`
`177`		-```bash
`178`		`-source .venv/bin/activate`
`179`		-```
	`209`	`+ positional arguments:`
	`210`	`+ {run,dry-run,init}`
	`211`	`+ run Run a pipeline. If no refresh options specified, a`
	`212`	`+ default incremental update is performed.`
	`213`	`+ dry-run Launch a run that just validates the graph and checks`
	`214`	`+ for errors.`
	`215`	`+ init Generate a sample pipeline project, including a spec`
	`216`	`+ file and example definitions.`
`180`	`217`
`181`		-```console
`182`		`-❯ $SPARK_HOME/bin/spark-pipelines --help`
`183`		`-usage: cli.py [-h] {run,dry-run,init} ...`
`184`		`-`
`185`		`-Pipelines CLI`
`186`		`-`
`187`		`-positional arguments:`
`188`		`- {run,dry-run,init}`
`189`		`- run Run a pipeline. If no refresh options specified, a`
`190`		`- default incremental update is performed.`
`191`		`- dry-run Launch a run that just validates the graph and checks`
`192`		`- for errors.`
`193`		`- init Generate a sample pipeline project, including a spec`
`194`		`- file and example definitions.`
`195`		`-`
`196`		`-options:`
`197`		`- -h, --help show this help message and exit`
`198`		-```
	`218`	`+ options:`
	`219`	`+ -h, --help show this help message and exit`
	`220`	+ ```
`199`	`221`
`200`		-```bash
`201`		`-$SPARK_HOME/bin/spark-pipelines dry-run`
`202`		-```
	`222`	+Execute `spark-pipelines dry-run` to validate a graph and checks for errors.
`203`	`223`
`204`		`-??? note "Output"`
`205`		- ```console
`206`		`- Traceback (most recent call last):`
`207`		`- File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>`
`208`		`- main()`
`209`		`- File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main`
`210`		`- spec_path = find_pipeline_spec(Path.cwd())`
`211`		`- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
`212`		`- File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec`
`213`		`- raise PySparkException(`
`214`		- pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
	`224`	`+You haven't created a pipeline graph yet (and any exceptions are expected).`
	`225`	`+`
	`226`	`+=== "Command Line"`
	`227`	`+`
	`228`	+ ```shell
	`229`	`+ $SPARK_HOME/bin/spark-pipelines dry-run`
`215`	`230`	```
`216`	`231`
	`232`	`+ !!! note ""`
	`233`	+ ```console
	`234`	`+ Traceback (most recent call last):`
	`235`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>`
	`236`	`+ main()`
	`237`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main`
	`238`	`+ spec_path = find_pipeline_spec(Path.cwd())`
	`239`	`+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
	`240`	`+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec`
	`241`	`+ raise PySparkException(`
	`242`	+ pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
	`243`	+ ```
	`244`	`+`
`217`	`245`	Create a demo double `hello-spark-pipelines` pipelines project with a sample `pipeline.yml` and sample transformations (in Python and in SQL).
`218`	`246`
`219`		-```bash
	`247`	+```shell
`220`	`248`	`$SPARK_HOME/bin/spark-pipelines init --name hello-spark-pipelines && \`
`221`	`249`	`mv hello-spark-pipelines/* . && \`
`222`	`250`	`rm -rf hello-spark-pipelines`
`@@ -242,53 +270,60 @@ transformations`
`242`	`270`	`1 directory, 2 files`
`243`	`271`	```
`244`	`272`
`245`		-```bash
`246`		`-$SPARK_HOME/bin/spark-pipelines dry-run`
`247`		-```
	`273`	`+=== "Command Line"`
`248`	`274`
`249`		`-??? note "Output"`
`250`		- ```text
`251`		`- 2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...`
`252`		`- 2025年08月03日 15:17:08: Creating Spark session...`
`253`		`- ...`
`254`		`- 2025年08月03日 15:17:10: Creating dataflow graph...`
`255`		`- 2025年08月03日 15:17:10: Registering graph elements...`
`256`		`- 2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.`
`257`		`- 2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/*/.py'`
`258`		`- 2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
`259`		`- 2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/*/.sql'`
`260`		`- 2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
`261`		`- 2025年08月03日 15:17:11: Starting run...`
`262`		`- 2025年08月03日 13:17:11: Run is COMPLETED.`
	`275`	+ ```shell
	`276`	`+ $SPARK_HOME/bin/spark-pipelines dry-run`
`263`	`277`	```
`264`	`278`
`265`		-```bash
`266`		`-$SPARK_HOME/bin/spark-pipelines run`
`267`		-```
	`279`	`+ !!! note ""`
`268`	`280`
`269`		`-??? note "Output"`
`270`		- ```console
`271`		`- 2025年08月03日 15:17:58: Creating dataflow graph...`
`272`		`- 2025年08月03日 15:17:58: Registering graph elements...`
`273`		`- 2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.`
`274`		`- 2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/*/.py'`
`275`		`- 2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
`276`		`- 2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/*/.sql'`
`277`		`- 2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
`278`		`- 2025年08月03日 15:17:58: Starting run...`
`279`		`- 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.`
`280`		`- 2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.`
`281`		`- 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.`
`282`		`- 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.`
`283`		`- 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.`
`284`		`- 2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.`
`285`		`- 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.`
`286`		`- 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.`
`287`		`- 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.`
`288`		`- 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.`
`289`		`- 2025年08月03日 13:18:03: Run is COMPLETED.`
	`281`	+ ```text
	`282`	`+ 2025年08月31日 12:26:59: Creating dataflow graph...`
	`283`	`+ 2025年08月31日 12:27:00: Dataflow graph created (ID: c11526a6-bffe-4708-8efe-7c146696d43c).`
	`284`	`+ 2025年08月31日 12:27:00: Registering graph elements...`
	`285`	`+ 2025年08月31日 12:27:00: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.`
	`286`	`+ 2025年08月31日 12:27:00: Found 1 files matching glob 'transformations/*/.py'`
	`287`	`+ 2025年08月31日 12:27:00: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
	`288`	`+ 2025年08月31日 12:27:00: Found 1 files matching glob 'transformations/*/.sql'`
	`289`	`+ 2025年08月31日 12:27:00: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
	`290`	`+ 2025年08月31日 12:27:00: Starting run (dry=True, full_refresh=[], full_refresh_all=False, refresh=[])...`
	`291`	`+ 2025年08月31日 10:27:00: Run is COMPLETED.`
	`292`	+ ```
	`293`	`+`
	`294`	`+Run the pipeline.`
	`295`	`+`
	`296`	`+=== "Command Line"`
	`297`	`+`
	`298`	+ ```shell
	`299`	`+ $SPARK_HOME/bin/spark-pipelines run`
`290`	`300`	```
`291`	`301`
	`302`	`+ !!! note ""`
	`303`	`+`
	`304`	+ ```console
	`305`	`+ 2025年08月31日 12:29:04: Creating dataflow graph...`
	`306`	`+ 2025年08月31日 12:29:04: Dataflow graph created (ID: 3851261d-9d74-416a-8ec6-22a28bee381c).`
	`307`	`+ 2025年08月31日 12:29:04: Registering graph elements...`
	`308`	`+ 2025年08月31日 12:29:04: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.`
	`309`	`+ 2025年08月31日 12:29:04: Found 1 files matching glob 'transformations/*/.py'`
	`310`	`+ 2025年08月31日 12:29:04: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...`
	`311`	`+ 2025年08月31日 12:29:04: Found 1 files matching glob 'transformations/*/.sql'`
	`312`	`+ 2025年08月31日 12:29:04: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...`
	`313`	`+ 2025年08月31日 12:29:04: Starting run (dry=False, full_refresh=[], full_refresh_all=False, refresh=[])...`
	`314`	`+ 2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is QUEUED.`
	`315`	`+ 2025年08月31日 10:29:05: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.`
	`316`	`+ 2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is PLANNING.`
	`317`	`+ 2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is STARTING.`
	`318`	`+ 2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is RUNNING.`
	`319`	`+ 2025年08月31日 10:29:06: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.`
	`320`	`+ 2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.`
	`321`	`+ 2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is STARTING.`
	`322`	`+ 2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.`
	`323`	`+ 2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.`
	`324`	`+ 2025年08月31日 10:29:09: Run is COMPLETED.`
	`325`	+ ```
	`326`	`+`
`292`	`327`	```console
`293`	`328`	`❯ tree spark-warehouse`
`294`	`329`	`spark-warehouse`
`@@ -365,7 +400,7 @@ val graphCtx: GraphRegistrationContext =`
`365`	`400`	```scala
`366`	`401`	`import org.apache.spark.sql.pipelines.graph.DataflowGraph`
`367`	`402`
`368`		`-val sdp: DataflowGraph = graphCtx.toDataflowGraph`
	`403`	`+val dp: DataflowGraph = graphCtx.toDataflowGraph`
`369`	`404`	```
`370`	`405`
`371`	`406`	`### Step 4. Create Update Context`
`@@ -379,7 +414,7 @@ import org.apache.spark.sql.pipelines.logging.PipelineEvent`
`379`	`414`	`val swallowEventsCallback: PipelineEvent => Unit = _ => ()`
`380`	`415`
`381`	`416`	`val updateCtx: PipelineUpdateContext =`
`382`		`- new PipelineUpdateContextImpl(unresolvedGraph=sdp, eventCallback=swallowEventsCallback)`
	`417`	`+ new PipelineUpdateContextImpl(unresolvedGraph=dp, eventCallback=swallowEventsCallback)`
`383`	`418`	```
`384`	`419`
`385`	`420`	`### Step 5. Start Pipeline`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit 4741467

File tree

2 files changed

2 files changed

`‎docs/declarative-pipelines/index.md`

0 commit comments