Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit b8aee66

Browse files
[SDP] Demo: spark-pipelines CLI
1 parent e53d60f commit b8aee66

File tree

1 file changed

+162
-9
lines changed

1 file changed

+162
-9
lines changed

‎docs/declarative-pipelines/index.md

Lines changed: 162 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,159 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
2222

2323
Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (on a [PipelineExecution](PipelineExecution.md)).
2424

25+
## Demo: spark-pipelines CLI
26+
27+
```bash
28+
uv init hello-spark-pipelines
29+
```
30+
31+
```bash
32+
cd hello-spark-pipelines
33+
```
34+
35+
```console
36+
uv pip list
37+
Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none
38+
Package Version
39+
------- -------
40+
pip 24.3.1
41+
```
42+
43+
```bash
44+
export SPARK_HOME=/Users/jacek/oss/spark
45+
```
46+
47+
```bash
48+
uv add $SPARK_HOME/python/packaging/client
49+
```
50+
51+
```console
52+
uv pip list
53+
Package Version
54+
------------------------ -----------
55+
googleapis-common-protos 1.70.0
56+
grpcio 1.74.0
57+
grpcio-status 1.74.0
58+
numpy 2.3.2
59+
pandas 2.3.1
60+
protobuf 6.31.1
61+
pyarrow 21.0.0
62+
pyspark-client 4.1.0.dev0
63+
python-dateutil 2.9.0.post0
64+
pytz 2025.2
65+
pyyaml 6.0.2
66+
six 1.17.0
67+
tzdata 2025.2
68+
```
69+
70+
```bash
71+
source .venv/bin/activate
72+
```
73+
74+
```console
75+
$ $SPARK_HOME/bin/spark-pipelines --help
76+
usage: cli.py [-h] {run,dry-run,init} ...
77+
78+
Pipelines CLI
79+
80+
positional arguments:
81+
{run,dry-run,init}
82+
run Run a pipeline. If no refresh options specified, a
83+
default incremental update is performed.
84+
dry-run Launch a run that just validates the graph and checks
85+
for errors.
86+
init Generate a sample pipeline project, including a spec
87+
file and example definitions.
88+
89+
options:
90+
-h, --help show this help message and exit
91+
```
92+
93+
```console
94+
$SPARK_HOME/bin/spark-pipelines dry-run
95+
Traceback (most recent call last):
96+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
97+
main()
98+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
99+
spec_path = find_pipeline_spec(Path.cwd())
100+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
101+
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
102+
raise PySparkException(
103+
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
104+
```
105+
106+
```bash
107+
$SPARK_HOME/bin/spark-pipelines init --name delete_me
108+
```
109+
110+
```bash
111+
mv delete_me/* .; rm -rf delete_me
112+
```
113+
114+
```console
115+
cat pipeline.yml
116+
117+
name: delete_me
118+
definitions:
119+
- glob:
120+
include: transformations/**/*.py
121+
- glob:
122+
include: transformations/**/*.sql
123+
```
124+
125+
```console
126+
$SPARK_HOME/bin/spark-pipelines dry-run
127+
2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...
128+
2025年08月03日 15:17:08: Creating Spark session...
129+
...
130+
2025年08月03日 15:17:10: Creating dataflow graph...
131+
2025年08月03日 15:17:10: Registering graph elements...
132+
2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
133+
2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/**/*.py'
134+
2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
135+
2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/**/*.sql'
136+
2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
137+
2025年08月03日 15:17:11: Starting run...
138+
2025年08月03日 13:17:11: Run is COMPLETED.
139+
```
140+
141+
```console
142+
$SPARK_HOME/bin/spark-pipelines run
143+
...
144+
2025年08月03日 15:17:58: Creating dataflow graph...
145+
2025年08月03日 15:17:58: Registering graph elements...
146+
2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
147+
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.py'
148+
2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
149+
2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.sql'
150+
2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
151+
2025年08月03日 15:17:58: Starting run...
152+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
153+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
154+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
155+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.
156+
2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
157+
2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
158+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
159+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
160+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
161+
2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
162+
2025年08月03日 13:18:03: Run is COMPLETED.
163+
```
164+
165+
```console
166+
tree spark-warehouse
167+
spark-warehouse
168+
├── example_python_materialized_view
169+
│ ├── _SUCCESS
170+
│ └── part-00000-75bc5b01-aea2-4d05-a71c-5c04937981bc-c000.snappy.parquet
171+
└── example_sql_materialized_view
172+
├── _SUCCESS
173+
└── part-00000-e1d0d33c-5d9e-43c3-a87d-f5f772d32942-c000.snappy.parquet
174+
175+
3 directories, 4 files
176+
```
177+
25178
## Spark Connect Only { #spark-connect }
26179

27180
Declarative Pipelines currently only supports Spark Connect.
@@ -44,7 +197,15 @@ Exception in thread "main" org.apache.spark.SparkUserAppException: User applicat
44197

45198
`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
46199

47-
## Demo
200+
## Dataset Types
201+
202+
Declarative Pipelines supports the following dataset types:
203+
204+
* **Materialized Views** (datasets) that are published to a catalog
205+
* **Table** that are published to a catalog
206+
* **Views** that are not published to a catalog
207+
208+
## Demo: Scala API
48209

49210
### Step 1. Register Dataflow Graph
50211

@@ -102,14 +263,6 @@ val updateCtx: PipelineUpdateContext =
102263
updateCtx.pipelineExecution.runPipeline()
103264
```
104265

105-
## Dataset Types
106-
107-
Declarative Pipelines supports the following dataset types:
108-
109-
* **Materialized Views** (datasets) that are published to a catalog
110-
* **Table** that are published to a catalog
111-
* **Views** that are not published to a catalog
112-
113266
## Learning Resources
114267

115268
1. [Spark Declarative Pipelines Programming Guide](https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md)

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /