You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/SparkPipelines.md
+61-2Lines changed: 61 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,12 +4,20 @@ title: SparkPipelines
4
4
5
5
# SparkPipelines — Spark Pipelines CLI
6
6
7
-
`SparkPipelines` is a standalone application that can be executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
7
+
`SparkPipelines` is a standalone application that is executed using [spark-pipelines](./index.md#spark-pipelines) shell script.
8
8
9
-
`SparkPipelines` is a Scala "launchpad" to execute [python/pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
9
+
`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
10
10
11
11
## PySpark Pipelines CLI
12
12
13
+
`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
14
+
15
+
The Pipelines CLI supports the following commands:
16
+
17
+
*[dry-run](#dry-run)
18
+
*[init](#init)
19
+
*[run](#run)
20
+
13
21
=== "uv run"
14
22
15
23
```console
@@ -61,3 +69,54 @@ Option | Description | Default
61
69
`--full-refresh` | List of datasets to reset and recompute (comma-separated) | (empty)
62
70
`--full-refresh-all` | Perform a full graph reset and recompute | (undefined)
63
71
`--refresh` | List of datasets to update (comma-separated) | (empty)
72
+
73
+
When executed, `run` prints out the following log message:
74
+
75
+
```text
76
+
Loading pipeline spec from [spec_path]...
77
+
```
78
+
79
+
`run` loads a pipeline spec.
80
+
81
+
`run` prints out the following log message:
82
+
83
+
```text
84
+
Creating Spark session...
85
+
```
86
+
87
+
`run` creates a Spark session with the configurations from the pipeline spec.
88
+
89
+
`run` prints out the following log message:
90
+
91
+
```text
92
+
Creating dataflow graph...
93
+
```
94
+
95
+
`run` sends a `CreateDataflowGraph` command for execution in the Spark Connect server.
96
+
97
+
!!! note "Spark Connect Server and Command Execution"
98
+
`CreateDataflowGraph` and other pipeline commands are handled by [PipelinesHandler](PipelinesHandler.md) on the Spark Connect server.
99
+
100
+
`run` prints out the following log message:
101
+
102
+
```text
103
+
Dataflow graph created (ID: [dataflow_graph_id]).
104
+
```
105
+
106
+
`run` prints out the following log message:
107
+
108
+
```text
109
+
Registering graph elements...
110
+
```
111
+
112
+
`run` creates a [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md) and `register_definitions`.
113
+
114
+
`run` prints out the following log message:
115
+
116
+
```text
117
+
Starting run (dry=[dry], full_refresh=[full_refresh], full_refresh_all=[full_refresh_all], refresh=[refresh])...
118
+
```
119
+
120
+
`run` sends a `StartRun` command for execution in the Spark Connect server.
121
+
122
+
In the end, `run` keeps printing out pipeline events from the Spark Connect server.
0 commit comments