Explode array values using PySpark

Question 1

I am new to pyspark and I need to explode my array of values in such a way that each value gets assigned to a new column. I tried using explode but I couldn't get the desired output.Below is my output

+---------------+----------+------------------+----------+---------+------------+--------------------+
|account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number| transactions|
+---------------+----------+------------------+----------+---------+------------+--------------------+
| 100000| 12345| 12345| abc| xyz| 1234567890|[1000, 01/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[1100, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[6146, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[253, 03/06/2020,...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[4521, 04/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[955, 05/06/2020,...|
+---------------+----------+------------------+----------+---------+------------+--------------------+

Below is the schema of the program

root
 |-- account_balance: long (nullable = true)
 |-- account_id: long (nullable = true)
 |-- credit_Card_Number: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- transactions: array (nullable = true)
 | |-- element: struct (containsNull = true)
 | | |-- amount: long (nullable = true)
 | | |-- date: string (nullable = true)
 | | |-- shop: string (nullable = true)
 | | |-- transaction_code: string (nullable = true)

I want an output in which I have additional columns of amount,date,shop,transaction_code with their respective values

amount date shop transaction_code
1000 01/06/2020 amazon buy
1100 02/06/2020 amazon sell
6146 02/06/2020 ebay buy
253 03/06/2020 ebay buy 
4521 04/06/2020 amazon buy
955 05/06/2020 amazon buy

Question 2

Use explode and then split the struct fileds, finally drop the newly exploded and transactions array columns.

Example:

from pyspark.sql.functions import *
#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- amount: long (nullable = true)
# | | |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount| date|
#+---------------+------+--------+
#| 10| 1000|20200202|
#+---------------+------+--------+

Question 3

Great solution ! Thank you so much !

notNull 31.8k4 gold badges41 silver badges58 bronze badges · Accepted Answer · 2020-06-29 17:59:07Z

Use explode and then split the struct fileds, finally drop the newly exploded and transactions array columns.

Example:

from pyspark.sql.functions import *
#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- amount: long (nullable = true)
# | | |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount| date|
#+---------------+------+--------+
#| 10| 1000|20200202|
#+---------------+------+--------+

CollectivesTM on Stack Overflow

Explode array values using PySpark

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related