@@ -22,6 +22,159 @@ Declarative Pipelines uses the following [Python decorators](https://peps.python
22
22
23
23
Once described, a pipeline can be [ started] ( PipelineExecution.md#runPipeline ) (on a [ PipelineExecution] ( PipelineExecution.md ) ).
24
24
25
+ ## Demo: spark-pipelines CLI
26
+
27
+ ``` bash
28
+ uv init hello-spark-pipelines
29
+ ```
30
+
31
+ ``` bash
32
+ cd hello-spark-pipelines
33
+ ```
34
+
35
+ ``` console
36
+ ❯ uv pip list
37
+ Using Python 3.12.11 environment at: /Users/jacek/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none
38
+ Package Version
39
+ ------- -------
40
+ pip 24.3.1
41
+ ```
42
+
43
+ ``` bash
44
+ export SPARK_HOME=/Users/jacek/oss/spark
45
+ ```
46
+
47
+ ``` bash
48
+ uv add --editable $SPARK_HOME /python/packaging/client
49
+ ```
50
+
51
+ ``` console
52
+ ❯ uv pip list
53
+ Package Version Editable project location
54
+ ------------------------ ----------- ----------------------------------------------
55
+ googleapis-common-protos 1.70.0
56
+ grpcio 1.74.0
57
+ grpcio-status 1.74.0
58
+ numpy 2.3.2
59
+ pandas 2.3.1
60
+ protobuf 6.31.1
61
+ pyarrow 21.0.0
62
+ pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
63
+ python-dateutil 2.9.0.post0
64
+ pytz 2025.2
65
+ pyyaml 6.0.2
66
+ six 1.17.0
67
+ tzdata 2025.2
68
+ ```
69
+
70
+ ``` bash
71
+ source .venv/bin/activate
72
+ ```
73
+
74
+ ``` console
75
+ ❯ $SPARK_HOME /bin/spark-pipelines --help
76
+ usage: cli.py [-h] {run,dry-run,init} ...
77
+
78
+ Pipelines CLI
79
+
80
+ positional arguments:
81
+ {run,dry-run,init}
82
+ run Run a pipeline. If no refresh options specified, a
83
+ default incremental update is performed.
84
+ dry-run Launch a run that just validates the graph and checks
85
+ for errors.
86
+ init Generate a sample pipeline project, including a spec
87
+ file and example definitions.
88
+
89
+ options:
90
+ -h, --help show this help message and exit
91
+ ```
92
+
93
+ ``` console
94
+ ❯ $SPARK_HOME /bin/spark-pipelines dry-run
95
+ Traceback (most recent call last):
96
+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
97
+ main()
98
+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
99
+ spec_path = find_pipeline_spec(Path.cwd())
100
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
101
+ File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
102
+ raise PySparkException(
103
+ pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
104
+ ```
105
+
106
+ ``` bash
107
+ $SPARK_HOME /bin/spark-pipelines init --name delete_me
108
+ ```
109
+
110
+ ``` bash
111
+ mv delete_me/* . ; rm -rf delete_me
112
+ ```
113
+
114
+ ``` console
115
+ ❯ cat pipeline.yml
116
+
117
+ name: delete_me
118
+ definitions:
119
+ - glob:
120
+ include: transformations/**/*.py
121
+ - glob:
122
+ include: transformations/**/*.sql
123
+ ```
124
+
125
+ ``` console
126
+ ❯ $SPARK_HOME /bin/spark-pipelines dry-run
127
+ 2025年08月03日 15:17:08: Loading pipeline spec from /private/tmp/hello-spark-pipelines/pipeline.yml...
128
+ 2025年08月03日 15:17:08: Creating Spark session...
129
+ ...
130
+ 2025年08月03日 15:17:10: Creating dataflow graph...
131
+ 2025年08月03日 15:17:10: Registering graph elements...
132
+ 2025年08月03日 15:17:10: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
133
+ 2025年08月03日 15:17:10: Found 1 files matching glob 'transformations/**/*.py'
134
+ 2025年08月03日 15:17:10: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
135
+ 2025年08月03日 15:17:11: Found 1 files matching glob 'transformations/**/*.sql'
136
+ 2025年08月03日 15:17:11: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
137
+ 2025年08月03日 15:17:11: Starting run...
138
+ 2025年08月03日 13:17:11: Run is COMPLETED.
139
+ ```
140
+
141
+ ``` console
142
+ ❯ $SPARK_HOME /bin/spark-pipelines run
143
+ ...
144
+ 2025年08月03日 15:17:58: Creating dataflow graph...
145
+ 2025年08月03日 15:17:58: Registering graph elements...
146
+ 2025年08月03日 15:17:58: Loading definitions. Root directory: '/private/tmp/hello-spark-pipelines'.
147
+ 2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.py'
148
+ 2025年08月03日 15:17:58: Importing /private/tmp/hello-spark-pipelines/transformations/example_python_materialized_view.py...
149
+ 2025年08月03日 15:17:58: Found 1 files matching glob 'transformations/**/*.sql'
150
+ 2025年08月03日 15:17:58: Registering SQL file /private/tmp/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
151
+ 2025年08月03日 15:17:58: Starting run...
152
+ 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
153
+ 2025年08月03日 13:17:59: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
154
+ 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
155
+ 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is STARTING.
156
+ 2025年08月03日 13:17:59: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
157
+ 2025年08月03日 13:18:00: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
158
+ 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
159
+ 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
160
+ 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
161
+ 2025年08月03日 13:18:01: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
162
+ 2025年08月03日 13:18:03: Run is COMPLETED.
163
+ ```
164
+
165
+ ``` console
166
+ ❯ tree spark-warehouse
167
+ spark-warehouse
168
+ ├── example_python_materialized_view
169
+ │ ├── _SUCCESS
170
+ │ └── part-00000-75bc5b01-aea2-4d05-a71c-5c04937981bc-c000.snappy.parquet
171
+ └── example_sql_materialized_view
172
+ ├── _SUCCESS
173
+ └── part-00000-e1d0d33c-5d9e-43c3-a87d-f5f772d32942-c000.snappy.parquet
174
+
175
+ 3 directories, 4 files
176
+ ```
177
+
25
178
## Spark Connect Only { #spark-connect }
26
179
27
180
Declarative Pipelines currently only supports Spark Connect.
@@ -44,7 +197,15 @@ Exception in thread "main" org.apache.spark.SparkUserAppException: User applicat
44
197
45
198
` spark-pipelines ` shell script is used to launch [ org.apache.spark.deploy.SparkPipelines] ( SparkPipelines.md ) .
46
199
47
- ## Demo
200
+ ## Dataset Types
201
+
202
+ Declarative Pipelines supports the following dataset types:
203
+
204
+ * ** Materialized Views** (datasets) that are published to a catalog
205
+ * ** Table** that are published to a catalog
206
+ * ** Views** that are not published to a catalog
207
+
208
+ ## Demo: Scala API
48
209
49
210
### Step 1. Register Dataflow Graph
50
211
@@ -102,14 +263,6 @@ val updateCtx: PipelineUpdateContext =
102
263
updateCtx.pipelineExecution.runPipeline()
103
264
```
104
265
105
- ## Dataset Types
106
-
107
- Declarative Pipelines supports the following dataset types:
108
-
109
- * ** Materialized Views** (datasets) that are published to a catalog
110
- * ** Table** that are published to a catalog
111
- * ** Views** that are not published to a catalog
112
-
113
266
## Learning Resources
114
267
115
268
1 . [ Spark Declarative Pipelines Programming Guide] ( https://github.com/apache/spark/blob/master/docs/declarative-pipelines-programming-guide.md )
0 commit comments