Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 1c2b494

Browse files
[DP] Spark Pipelines CLI and Spark Connect commands
1 parent 3d5627b commit 1c2b494

File tree

2 files changed

+102
-3
lines changed

2 files changed

+102
-3
lines changed

‎docs/declarative-pipelines/SparkPipelines.md

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,20 @@ title: SparkPipelines
44

55
# SparkPipelines — Spark Pipelines CLI
66

7-
`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
7+
`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
88

9-
`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
9+
`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
1010

1111
## PySpark Pipelines CLI
1212

13+
`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
14+
15+
The Pipelines CLI supports the following commands:
16+
17+
* [dry-run](#dry-run)
18+
* [init](#init)
19+
* [run](#run)
20+
1321
=== "uv run"
1422

1523
```console
@@ -61,3 +69,54 @@ Option | Description | Default
6169
`--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
6270
`--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
6371
`--refresh` | List of datasets to update (comma-separated) | (empty)
72+
73+
When executed, `run` prints out the following log message:
74+
75+
```text
76+
Loading pipeline spec from [spec_path]...
77+
```
78+
79+
`run` loads a pipeline spec.
80+
81+
`run` prints out the following log message:
82+
83+
```text
84+
Creating Spark session...
85+
```
86+
87+
`run` creates a Spark session with the configurations from the pipeline spec.
88+
89+
`run` prints out the following log message:
90+
91+
```text
92+
Creating dataflow graph...
93+
```
94+
95+
`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server.
96+
97+
!!! note "Spark Connect Server and Command Execution"
98+
`CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server.
99+
100+
`run` prints out the following log message:
101+
102+
```text
103+
Dataflow graph created (ID: [dataflow_graph_id]).
104+
```
105+
106+
`run` prints out the following log message:
107+
108+
```text
109+
Registering graph elements...
110+
```
111+
112+
`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`.
113+
114+
`run` prints out the following log message:
115+
116+
```text
117+
Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
118+
```
119+
120+
`run` sends a `StartRun` command for execution in the Spark Connect server.
121+
122+
In the end, `run` keeps printing out pipeline events from the Spark Connect server.

‎docs/declarative-pipelines/index.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,30 @@ As of this [Commit 6ab0df9]({{ spark.commit }}/6ab0df9287c5a9ce49769612c2bb0a1da
4444
from pyspark import pipelines as dp
4545
```
4646

47+
## Pipeline Specification File
48+
49+
The heart of a Declarative Pipelines project is a pipeline specification file (in YAML format).
50+
51+
The following fields are supported:
52+
53+
Field Name | Description
54+
-|-
55+
`name` (required) | |
56+
`catalog` | |
57+
`database` | |
58+
`schema` | Alias of `database`. Used unless `database` is defined |
59+
`configuration` | |
60+
`definitions` | `glob` of `include`s |
61+
62+
```yaml
63+
name: hello-spark-pipelines
64+
definitions:
65+
- glob:
66+
include: transformations/**/*.py
67+
- glob:
68+
include: transformations/**/*.sql
69+
```
70+
4771
## Python Decorators for Tables and Flows { #python-decorators }
4872
4973
Declarative Pipelines uses the following [Python decorators](https://peps.python.org/pep-0318/) to describe tables and views:
@@ -198,7 +222,7 @@ Run `spark-pipelines --help` to learn the options.
198222
=== "Command Line"
199223

200224
```shell
201-
$ $SPARK_HOME/bin/spark-pipelines --help
225+
$SPARK_HOME/bin/spark-pipelines --help
202226
```
203227

204228
!!! note ""
@@ -272,6 +296,22 @@ transformations
272296
1 directory, 2 files
273297
```
274298

299+
!!! warning "Spark Connect Server should be down"
300+
`spark-pipelines dry-run` starts its own Spark Connect Server at 15002 port (unless started with `--remote` option).
301+
302+
Shut down Spark Connect Server if you started it already.
303+
304+
```shell
305+
$SPARK_HOME/sbin/stop-connect-server.sh
306+
```
307+
308+
!!! info "`--remote` option"
309+
Use `--remote` option to connect to a standalone Spark Connect Server.
310+
311+
```shell
312+
$SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run
313+
```
314+
275315
=== "Command Line"
276316

277317
```shell

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /