Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 3ab9b88

Browse files
[SDP] Demo: spark-pipelines CLI
1 parent e53d60f commit 3ab9b88

File tree

1 file changed

+185
-9
lines changed

1 file changed

+185
-9
lines changed

‎docs/declarative-pipelines/index.md

Lines changed: 185 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,182 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
2222

2323
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
2424

25+
## Demo: spark-pipelines CLI
26+
27+
```bash
28+
uv init hello-spark-pipelines
29+
```
30+
31+
```bash
32+
cd hello-spark-pipelines
33+
```
34+
35+
```console
36+
uv pip list
37+
Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none
38+
Package Version
39+
------- -------
40+
pip 24.3.1
41+
```
42+
43+
```bash
44+
export SPARK_HOME=/Users/jacek/oss/spark
45+
```
46+
47+
```bash
48+
uv add --editable $SPARK_HOME/python/packaging/client
49+
```
50+
51+
```console
52+
uv pip list
53+
Package Version Editable project location
54+
------------------------ ----------- ----------------------------------------------
55+
googleapis-common-protos 1.70.0
56+
grpcio 1.74.0
57+
grpcio-status 1.74.0
58+
numpy 2.3.2
59+
pandas 2.3.1
60+
protobuf 6.31.1
61+
pyarrow 21.0.0
62+
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
63+
python-dateutil 2.9.0.post0
64+
pytz 2025.2
65+
pyyaml 6.0.2
66+
six 1.17.0
67+
tzdata 2025.2
68+
```
69+
70+
Activate (_source_) the virtual environment (that `uv` helped us create).
71+
It will bring all the necessary PySpark modules that have not been released yet and are only available in the source format only.
72+
73+
```bash
74+
source .venv/bin/activate
75+
```
76+
77+
```console
78+
$SPARK_HOME/bin/spark-pipelines --help
79+
usage: cli.py [-h] {run,dry-run,init} ...
80+
81+
Pipelines CLI
82+
83+
positional arguments:
84+
{run,dry-run,init}
85+
run Run a pipeline. If no refresh options specified, a
86+
default incremental update is performed.
87+
dry-run Launch a run that just validates the graph and checks
88+
for errors.
89+
init Generate a sample pipeline project, including a spec
90+
file and example definitions.
91+
92+
options:
93+
-h, --help show this help message and exit
94+
```
95+
96+
```bash
97+
$SPARK_HOME/bin/spark-pipelines dry-run
98+
```
99+
100+
??? note "Output"
101+
```console
102+
Traceback (most recent call last):
103+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
104+
main()
105+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
106+
spec_path = find_pipeline_spec(Path.cwd())
107+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
108+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
109+
raise PySparkException(
110+
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
111+
```
112+
113+
Create a demo double `hello-spark-pipelines` pipelines project with a sample `pipeline.yml` and sample transformations (in Python and in SQL).
114+
115+
```bash
116+
$SPARK_HOME/bin/spark-pipelines init --name hello-spark-pipelines && \
117+
mv hello-spark-pipelines/* . && \
118+
rm -rf hello-spark-pipelines
119+
```
120+
121+
```console
122+
cat pipeline.yml
123+
124+
name: hello-spark-pipelines
125+
definitions:
126+
- glob:
127+
include: transformations/**/*.py
128+
- glob:
129+
include: transformations/**/*.sql
130+
```
131+
132+
```console
133+
tree transformations
134+
transformations
135+
├── example_python_materialized_view.py
136+
└── example_sql_materialized_view.sql
137+
138+
1 directory, 2 files
139+
```
140+
141+
```bash
142+
$SPARK_HOME/bin/spark-pipelines dry-run
143+
```
144+
145+
??? note "Output"
146+
```text
147+
2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...
148+
2025年08月03日 15:17:08: Creating Spark session...
149+
...
150+
2025年08月03日 15:17:10: Creating dataflow graph...
151+
2025年08月03日 15:17:10: Registering graph elements...
152+
2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
153+
2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/**/*.py'
154+
2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
155+
2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/**/*.sql'
156+
2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
157+
2025年08月03日 15:17:11: Starting run...
158+
2025年08月03日 13:17:11: Run is COMPLETED.
159+
```
160+
161+
```bash
162+
$SPARK_HOME/bin/spark-pipelines run
163+
```
164+
165+
??? note "Output"
166+
```console
167+
2025年08月03日 15:17:58: Creating dataflow graph...
168+
2025年08月03日 15:17:58: Registering graph elements...
169+
2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
170+
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.py'
171+
2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
172+
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.sql'
173+
2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
174+
2025年08月03日 15:17:58: Starting run...
175+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
176+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
177+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
178+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.
179+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
180+
2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
181+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
182+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
183+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
184+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
185+
2025年08月03日 13:18:03: Run is COMPLETED.
186+
```
187+
188+
```console
189+
tree spark-warehouse
190+
spark-warehouse
191+
├── example_python_materialized_view
192+
│ ├── _SUCCESS
193+
│ └── part-00000-75bc5b01-aea2-4d05-a71c-5c04937981bc-c000.snappy.parquet
194+
└── example_sql_materialized_view
195+
├── _SUCCESS
196+
└── part-00000-e1d0d33c-5d9e-43c3-a87d-f5f772d32942-c000.snappy.parquet
197+
198+
3 directories, 4 files
199+
```
200+
25201
## Spark Connect Only { #spark-connect }
26202

27203
Declarative Pipelines currently only supports Spark Connect.
@@ -44,7 +220,15 @@ Exception in thread "main" org.apache.spark.SparkUserAppException: User applicat
44220

45221
`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
46222

47-
## Demo
223+
## Dataset Types
224+
225+
Declarative Pipelines supports the following dataset types:
226+
227+
* **Materialized Views** (datasets) that are published to a catalog
228+
* **Table** that are published to a catalog
229+
* **Views** that are not published to a catalog
230+
231+
## Demo: Scala API
48232

49233
### Step 1. Register Dataflow Graph
50234

@@ -102,14 +286,6 @@ val updateCtx: PipelineUpdateContext =
102286
updateCtx.pipelineExecution.runPipeline()
103287
```
104288

105-
## Dataset Types
106-
107-
Declarative Pipelines supports the following dataset types:
108-
109-
* **Materialized Views** (datasets) that are published to a catalog
110-
* **Table** that are published to a catalog
111-
* **Views** that are not published to a catalog
112-
113289
## Learning Resources
114290

115291
1. [Spark Declarative Pipelines Programming Guide](https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md)

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /