Desired outcome:
+---------+-----------------------------+
| ID PR | Related Repeating Event(s) |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
Is there a way to write a query in sql / databricks without using a User-defined aggregate functions (UDAFs)? I've tried concat(), GROUP_CONCAT(), LISTAGG but none of these work or are not supported in databricks ("This function is neither a registered temporary function nor a permanent function registered in the database 'default'.".
I found this User-defined aggregate functions (UDAFs) description in the databricks documentation but don't know how to implement it (https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html#user-defined-aggregate-functions-udafs&language-sql)
Would anybody have a hint for me or a link?
What I have is this basic query:
%sql
SELECT
pr_id,
data_field_nm,
field_value
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl --(18)
WHERE
pr_id = 1658503
AND data_field_nm = 'Related Repeating Deviation(s)'
Which gives as output:
+---------+--------------------------------+-------------+
| pr_id | data_field_nm | field_value |
+---------+--------------------------------+-------------+
| 1658503 | Related Repeating Deviation(s) | 1615764 |
| 1658503 | Related Repeating Deviation(s) | 1639329 |
+---------+--------------------------------+-------------+
Correct answer is (thanks to @Alex Ott):
%sql
SELECT
pr_id AS IDPR,
concat_ws(';', collect_list(field_value)) AS RelatedRepeatingDeviations
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl
WHERE
data_field_nm = 'Related Repeating Deviation(s)'
AND pr_id = 1658503
GROUP BY
pr_id,
data_field_nm;
Gives desired outcome:
+---------+-----------------------------+
| IDPR | RelatedRepeatingDeviations |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
-
1not a udaf aspectGed– Ged2021年02月09日 16:08:38 +00:00Commented Feb 9, 2021 at 16:08
-
1can u show code around it all pls,Ged– Ged2021年02月09日 16:09:18 +00:00Commented Feb 9, 2021 at 16:09
-
1added basic query for more contextWondarar– Wondarar2021年02月09日 16:17:05 +00:00Commented Feb 9, 2021 at 16:17
-
so you have n rows and want to make an rev from them. pls show input and expected outputGed– Ged2021年02月09日 16:20:18 +00:00Commented Feb 9, 2021 at 16:20
-
I've updated the question with all info I have, see query and its output.Wondarar– Wondarar2021年02月09日 16:42:39 +00:00Commented Feb 9, 2021 at 16:42
1 Answer 1
just use group by with collect_list and concat_ws, like this:
- get data
from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}),
Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")
- and do the query:
%sql
select pr_id,
data_field_nm,
concat_ws(';', collect_list(field_value)) as combined
from abc
group by pr_id, data_field_nm
although this will give you the column with the fixed name (combined)