Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 4741467

Browse files
[SDP] Python Import Alias Convention (dp) + Demo improvements
1 parent f6f646a commit 4741467

File tree

2 files changed

+168
-132
lines changed

2 files changed

+168
-132
lines changed

‎docs/declarative-pipelines/index.md

Lines changed: 167 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,20 @@ Declarative Pipelines uses [Python decorators](#python-decorators) to describe t
3636

3737
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
3838

39+
## Python Import Alias Convention
40+
41+
As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1daab83bee), the convention to alias the import of Declarative Pipelines in Python is `dp` (from `sdp`).
42+
43+
```python
44+
from pyspark import pipelines as dp
45+
```
46+
3947
## Python Decorators for Datasets and Flows { #python-decorators }
4048

4149
Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
4250

43-
* [@sdp.materialized_view](#materialized_view) for materialized views
44-
* [@sdp.table](#table) for streaming and batch tables
51+
* [@dp.materialized_view](#materialized_view) for materialized views
52+
* [@dp.table](#table) for streaming and batch tables
4553

4654
### pyspark.pipelines Python Module { #pyspark_pipelines }
4755

@@ -56,35 +64,87 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
5664
Use the following import in your Python code:
5765

5866
```py
59-
from pyspark import pipelines as sdp
67+
from pyspark import pipelines as dp
68+
```
69+
70+
### @dp.append_flow { #append_flow }
71+
72+
### @dp.create_streaming_table { #create_streaming_table }
73+
74+
### @dp.materialized_view { #materialized_view }
75+
76+
### @dp.table { #table }
77+
78+
### @dp.temporary_view { #temporary_view }
79+
80+
## Demo: Create Virtual Environment for Python Client
81+
82+
```shell
83+
uv init hello-spark-pipelines && cd hello-spark-pipelines
84+
```
85+
86+
```shell
87+
export SPARK_HOME=/Users/jacek/oss/spark
6088
```
6189

62-
### @sdp.append_flow { #append_flow }
90+
```shell
91+
uv add --editable $SPARK_HOME/python/packaging/client
92+
```
6393

64-
### @sdp.create_streaming_table { #create_streaming_table }
94+
```shell
95+
uv pip list
96+
```
6597

66-
### @sdp.materialized_view { #materialized_view }
98+
??? note "Output"
6799

68-
### @sdp.table { #table }
100+
```text
101+
Package Version Editable project location
102+
------------------------ ----------- ----------------------------------------------
103+
googleapis-common-protos 1.70.0
104+
grpcio 1.74.0
105+
grpcio-status 1.74.0
106+
numpy 2.3.2
107+
pandas 2.3.1
108+
protobuf 6.31.1
109+
pyarrow 21.0.0
110+
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
111+
python-dateutil 2.9.0.post0
112+
pytz 2025.2
113+
pyyaml 6.0.2
114+
six 1.17.0
115+
tzdata 2025.2
116+
```
69117

70-
### @sdp.temporary_view { #temporary_view }
118+
Activate (_source_) the virtual environment (that `uv` helped us create).
119+
120+
```shell
121+
source .venv/bin/activate
122+
```
123+
124+
This activation brings all the necessary PySpark modules that have not been released yet and are only available in the source format only (incl. Spark Declarative Pipelines).
71125

72126
## Demo: Python API
73127

128+
??? warning "Activate Virtual Environment"
129+
Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.
130+
74131
In a terminal, start a Spark Connect Server.
75132

76-
```bash
133+
```shell
77134
./sbin/start-connect-server.sh
78135
```
79136

80137
It will listen on port 15002.
81138

82-
??? note "Tip"
83-
Review the logs with `tail -f`.
139+
??? note "Monitor Logs"
140+
141+
```shell
142+
tail -f logs/*org.apache.spark.sql.connect.service.SparkConnectServer*.out
143+
```
84144

85145
Start a Spark Connect-enabled PySpark shell.
86146

87-
```bash
147+
```shell
88148
$SPARK_HOME/bin/pyspark --remote sc://localhost:15002
89149
```
90150

@@ -107,13 +167,13 @@ registry = SparkConnectGraphElementRegistry(spark, dataflow_graph_id)
107167
```
108168

109169
```py
110-
from pyspark import pipelines as sdp
170+
from pyspark import pipelines as dp
111171
```
112172

113173
```py
114174
from pyspark.pipelines.graph_element_registry import graph_element_registration_context
115175
with graph_element_registration_context(registry):
116-
sdp.create_streaming_table("demo_streaming_table")
176+
dp.create_streaming_table("demo_streaming_table")
117177
```
118178

119179
You should see the following INFO message in the logs of the Spark Connect Server:
@@ -128,95 +188,63 @@ INFO PipelinesHandler: Define pipelines dataset cmd received: define_dataset {
128188

129189
## Demo: spark-pipelines CLI
130190

131-
```bash
132-
uv init hello-spark-pipelines
133-
```
191+
??? warning "Activate Virtual Environment"
192+
Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.
134193

135-
```bash
136-
cd hello-spark-pipelines
137-
```
194+
Run `spark-pipelines --help` to learn the options.
138195

139-
```console
140-
uv pip list
141-
Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none
142-
Package Version
143-
------- -------
144-
pip 24.3.1
145-
```
196+
=== "Command Line"
146197

147-
```bash
148-
exportSPARK_HOME=/Users/jacek/oss/spark
149-
```
198+
```shell
199+
$ $SPARK_HOME/bin/spark-pipelines --help
200+
```
150201

151-
```bash
152-
uv add --editable $SPARK_HOME/python/packaging/client
153-
```
202+
!!! note ""
154203

155-
```console
156-
uv pip list
157-
Package Version Editable project location
158-
------------------------ ----------- ----------------------------------------------
159-
googleapis-common-protos 1.70.0
160-
grpcio 1.74.0
161-
grpcio-status 1.74.0
162-
numpy 2.3.2
163-
pandas 2.3.1
164-
protobuf 6.31.1
165-
pyarrow 21.0.0
166-
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
167-
python-dateutil 2.9.0.post0
168-
pytz 2025.2
169-
pyyaml 6.0.2
170-
six 1.17.0
171-
tzdata 2025.2
172-
```
204+
```text
205+
usage: cli.py [-h] {run,dry-run,init} ...
173206

174-
Activate (_source_) the virtual environment (that `uv` helped us create).
175-
It will bring all the necessary PySpark modules that have not been released yet and are only available in the source format only.
207+
Pipelines CLI
176208

177-
```bash
178-
source .venv/bin/activate
179-
```
209+
positional arguments:
210+
{run,dry-run,init}
211+
run Run a pipeline. If no refresh options specified, a
212+
default incremental update is performed.
213+
dry-run Launch a run that just validates the graph and checks
214+
for errors.
215+
init Generate a sample pipeline project, including a spec
216+
file and example definitions.
180217

181-
```console
182-
$SPARK_HOME/bin/spark-pipelines --help
183-
usage: cli.py [-h] {run,dry-run,init} ...
184-
185-
Pipelines CLI
186-
187-
positional arguments:
188-
{run,dry-run,init}
189-
run Run a pipeline. If no refresh options specified, a
190-
default incremental update is performed.
191-
dry-run Launch a run that just validates the graph and checks
192-
for errors.
193-
init Generate a sample pipeline project, including a spec
194-
file and example definitions.
195-
196-
options:
197-
-h, --help show this help message and exit
198-
```
218+
options:
219+
-h, --help show this help message and exit
220+
```
199221

200-
```bash
201-
$SPARK_HOME/bin/spark-pipelines dry-run
202-
```
222+
Execute `spark-pipelines dry-run` to validate a graph and checks for errors.
203223

204-
??? note "Output"
205-
```console
206-
Traceback (most recent call last):
207-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
208-
main()
209-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
210-
spec_path = find_pipeline_spec(Path.cwd())
211-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
212-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
213-
raise PySparkException(
214-
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
224+
You haven't created a pipeline graph yet (and any exceptions are expected).
225+
226+
=== "Command Line"
227+
228+
```shell
229+
$SPARK_HOME/bin/spark-pipelines dry-run
215230
```
216231

232+
!!! note ""
233+
```console
234+
Traceback (most recent call last):
235+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
236+
main()
237+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
238+
spec_path = find_pipeline_spec(Path.cwd())
239+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
241+
raise PySparkException(
242+
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
243+
```
244+
217245
Create a demo double `hello-spark-pipelines` pipelines project with a sample `pipeline.yml` and sample transformations (in Python and in SQL).
218246

219-
```bash
247+
```shell
220248
$SPARK_HOME/bin/spark-pipelines init --name hello-spark-pipelines && \
221249
mv hello-spark-pipelines/* . && \
222250
rm -rf hello-spark-pipelines
@@ -242,53 +270,60 @@ transformations
242270
1 directory, 2 files
243271
```
244272

245-
```bash
246-
$SPARK_HOME/bin/spark-pipelines dry-run
247-
```
273+
=== "Command Line"
248274

249-
??? note "Output"
250-
```text
251-
2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...
252-
2025年08月03日 15:17:08: Creating Spark session...
253-
...
254-
2025年08月03日 15:17:10: Creating dataflow graph...
255-
2025年08月03日 15:17:10: Registering graph elements...
256-
2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
257-
2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/**/*.py'
258-
2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
259-
2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/**/*.sql'
260-
2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
261-
2025年08月03日 15:17:11: Starting run...
262-
2025年08月03日 13:17:11: Run is COMPLETED.
275+
```shell
276+
$SPARK_HOME/bin/spark-pipelines dry-run
263277
```
264278

265-
```bash
266-
$SPARK_HOME/bin/spark-pipelines run
267-
```
279+
!!! note ""
268280

269-
??? note "Output"
270-
```console
271-
2025年08月03日 15:17:58: Creating dataflow graph...
272-
2025年08月03日 15:17:58: Registering graph elements...
273-
2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
274-
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.py'
275-
2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
276-
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.sql'
277-
2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
278-
2025年08月03日 15:17:58: Starting run...
279-
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
280-
2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
281-
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
282-
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.
283-
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
284-
2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
285-
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
286-
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
287-
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
288-
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
289-
2025年08月03日 13:18:03: Run is COMPLETED.
281+
```text
282+
2025年08月31日 12:26:59: Creating dataflow graph...
283+
2025年08月31日 12:27:00: Dataflow graph created (ID: c11526a6-bffe-4708-8efe-7c146696d43c).
284+
2025年08月31日 12:27:00: Registering graph elements...
285+
2025年08月31日 12:27:00: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
286+
2025年08月31日 12:27:00: Found 1 files matching glob 'transformations/**/*.py'
287+
2025年08月31日 12:27:00: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
288+
2025年08月31日 12:27:00: Found 1 files matching glob 'transformations/**/*.sql'
289+
2025年08月31日 12:27:00: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
290+
2025年08月31日 12:27:00: Starting run (dry=True, full_refresh=[], full_refresh_all=False, refresh=[])...
291+
2025年08月31日 10:27:00: Run is COMPLETED.
292+
```
293+
294+
Run the pipeline.
295+
296+
=== "Command Line"
297+
298+
```shell
299+
$SPARK_HOME/bin/spark-pipelines run
290300
```
291301

302+
!!! note ""
303+
304+
```console
305+
2025年08月31日 12:29:04: Creating dataflow graph...
306+
2025年08月31日 12:29:04: Dataflow graph created (ID: 3851261d-9d74-416a-8ec6-22a28bee381c).
307+
2025年08月31日 12:29:04: Registering graph elements...
308+
2025年08月31日 12:29:04: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
309+
2025年08月31日 12:29:04: Found 1 files matching glob 'transformations/**/*.py'
310+
2025年08月31日 12:29:04: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
311+
2025年08月31日 12:29:04: Found 1 files matching glob 'transformations/**/*.sql'
312+
2025年08月31日 12:29:04: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
313+
2025年08月31日 12:29:04: Starting run (dry=False, full_refresh=[], full_refresh_all=False, refresh=[])...
314+
2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
315+
2025年08月31日 10:29:05: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
316+
2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
317+
2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is STARTING.
318+
2025年08月31日 10:29:05: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
319+
2025年08月31日 10:29:06: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
320+
2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
321+
2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
322+
2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
323+
2025年08月31日 10:29:07: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
324+
2025年08月31日 10:29:09: Run is COMPLETED.
325+
```
326+
292327
```console
293328
tree spark-warehouse
294329
spark-warehouse
@@ -365,7 +400,7 @@ val graphCtx: GraphRegistrationContext =
365400
```scala
366401
import org.apache.spark.sql.pipelines.graph.DataflowGraph
367402

368-
val sdp: DataflowGraph = graphCtx.toDataflowGraph
403+
val dp: DataflowGraph = graphCtx.toDataflowGraph
369404
```
370405

371406
### Step 4. Create Update Context
@@ -379,7 +414,7 @@ import org.apache.spark.sql.pipelines.logging.PipelineEvent
379414
val swallowEventsCallback: PipelineEvent => Unit = _ => ()
380415

381416
val updateCtx: PipelineUpdateContext =
382-
new PipelineUpdateContextImpl(unresolvedGraph=sdp, eventCallback=swallowEventsCallback)
417+
new PipelineUpdateContextImpl(unresolvedGraph=dp, eventCallback=swallowEventsCallback)
383418
```
384419

385420
### Step 5. Start Pipeline

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /