output semicolon separated values in field in databricks SQL

Question 1

Desired outcome:

+---------+-----------------------------+
| ID PR | Related Repeating Event(s) |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+

Is there a way to write a query in sql / databricks without using a User-defined aggregate functions (UDAFs)? I've tried concat(), GROUP_CONCAT(), LISTAGG but none of these work or are not supported in databricks ("This function is neither a registered temporary function nor a permanent function registered in the database 'default'.".

I found this User-defined aggregate functions (UDAFs) description in the databricks documentation but don't know how to implement it (https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html#user-defined-aggregate-functions-udafs&language-sql)

Would anybody have a hint for me or a link?

What I have is this basic query:

%sql
SELECT
 pr_id,
 data_field_nm,
 field_value
FROM
 gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl --(18)
WHERE
 pr_id = 1658503
 AND data_field_nm = 'Related Repeating Deviation(s)'

Which gives as output:

+---------+--------------------------------+-------------+
| pr_id | data_field_nm | field_value |
+---------+--------------------------------+-------------+
| 1658503 | Related Repeating Deviation(s) | 1615764 |
| 1658503 | Related Repeating Deviation(s) | 1639329 |
+---------+--------------------------------+-------------+

Correct answer is (thanks to @Alex Ott):

%sql
SELECT
 pr_id AS IDPR,
 concat_ws(';', collect_list(field_value)) AS RelatedRepeatingDeviations
FROM
 gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl
WHERE
 data_field_nm = 'Related Repeating Deviation(s)'
 AND pr_id = 1658503
GROUP BY
 pr_id,
 data_field_nm;

Gives desired outcome:

+---------+-----------------------------+
| IDPR | RelatedRepeatingDeviations |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+

Question 2

not a udaf aspect

Question 3

can u show code around it all pls,

Question 4

added basic query for more context

Question 5

so you have n rows and want to make an rev from them. pls show input and expected output

Question 6

I've updated the question with all info I have, see query and its output.

Question 7

just use group by with collect_list and concat_ws, like this:

get data

from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}), 
 Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")

and do the query:

%sql
select pr_id, 
 data_field_nm, 
 concat_ws(';', collect_list(field_value)) as combined 
from abc 
group by pr_id, data_field_nm

although this will give you the column with the fixed name (combined)

Question 8

thanks @alex, I've updated/edited my question with the adapted answer and got the desired outcome.

Question 9

yes, bad manuals for a noob like me. All you get as info on the databricks doc site: "collect_list(expr) - Collects and returns a list of non-unique elements." A tiny example would have helped me a lot and saved hours of searching and trying

Question 10

Yeah, that’s a problem - sql docs are mostly coming from Spark docs themselves. In these cases, sites like sparkbyexamples.com could be helpful

Question 11

thanks for the spark site - that's exactly what I was looking for :)

Alex Ott 88.1k10 gold badges110 silver badges157 bronze badges · Accepted Answer · 2021-02-09 18:05:41Z

1

just use group by with collect_list and concat_ws, like this:

get data

from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}), 
 Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")

and do the query:

%sql
select pr_id, 
 data_field_nm, 
 concat_ws(';', collect_list(field_value)) as combined 
from abc 
group by pr_id, data_field_nm

although this will give you the column with the fixed name (combined)

Share

Improve this answer

answered Feb 9, 2021 at 18:05

Alex Ott's user avatar

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Wondarar

Wondarar Over a year ago

thanks @alex, I've updated/edited my question with the adapted answer and got the desired outcome.

2021年02月09日T19:52:01.783Z+00:00

Wondarar

Wondarar Over a year ago

yes, bad manuals for a noob like me. All you get as info on the databricks doc site: "collect_list(expr) - Collects and returns a list of non-unique elements." A tiny example would have helped me a lot and saved hours of searching and trying

2021年02月09日T19:59:22.607Z+00:00

Alex Ott

Alex Ott Over a year ago

Yeah, that’s a problem - sql docs are mostly coming from Spark docs themselves. In these cases, sites like sparkbyexamples.com could be helpful

2021年02月09日T20:06:11.947Z+00:00

Wondarar

Wondarar Over a year ago

thanks for the spark site - that's exactly what I was looking for :)

2021年02月09日T20:23:18.697Z+00:00

CollectivesTM on Stack Overflow

output semicolon separated values in field in databricks SQL

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions