How to remove duplicate rows based on array column subset relationship?

Question 1

I have a DolphinDB table with an array vector column. I need to remove duplicate rows based on subset relationships within that column.

Sample Input:

sym	prices
a	`[3,4,5,6]`
a	`[3,4,5]`
a	`[2,4,5,6]`
a	`[5,6]`
a	`[7,9]`
a	`[7,9]`

Expected Output:

sym	prices
a	`[3,4,5,6]`
a	`[2,4,5,6]`
a	`[7,9]`

Deduplication Logic:

Subset Removal: If a row's prices array is a subset (i.e., fully contained) of another row's prices array, remove the subset row. In the example, [3,4,5] is a subset of [3,4,5,6], so it is removed; similarly, [5,6] is also a subset of [3,4,5,6] and is removed.
Full Duplicate Removal: If multiple rows have identical prices arrays, keep only one.

What I've Tried:

I considered using group by to remove exact duplicates, but this approach cannot handle subset relationships.

Core Question:
How can I perform this subset-based deduplication?

Question 2

Disclaimer: I don't know DolphinDB.

You want to remove real subsets from the table. According to the docs (https://docs.dolphindb.com/en/Programming/Operators/OperatorReferences/lt.html) you can use the less-than operator for this:

delete from mytable subset
where exists
(
 select *
 from mytable superset
 where subset.prices < superset.prices
);

(If you only want to compare price vectors for the same sym, you must add and subset.sym = superset.sym to the subquery of course.)

You also want to remove duplicate sets and only keep one. For this you'll need an additional condition for equal sets (=), but then you'll also need some ID to tell one row from the other. In some DBMS there is a unique row ID built in. I don't know how it is in dolphin, so maybe you need a custom ID in your table. Then you can extend above statement as follows:

delete from mytable subset
where exists
(
 select *
 from mytable superset
 where subset.prices < superset.prices
 or (subset.prices = superset.prices and subset.id < superset.id)
);

Thorsten Kettner 96.7k8 gold badges56 silver badges82 bronze badges · Accepted Answer · 2025-11-20 22:01:00Z

Disclaimer: I don't know DolphinDB.

You want to remove real subsets from the table. According to the docs (https://docs.dolphindb.com/en/Programming/Operators/OperatorReferences/lt.html) you can use the less-than operator for this:

delete from mytable subset
where exists
(
 select *
 from mytable superset
 where subset.prices < superset.prices
);

(If you only want to compare price vectors for the same sym, you must add and subset.sym = superset.sym to the subquery of course.)

You also want to remove duplicate sets and only keep one. For this you'll need an additional condition for equal sets (=), but then you'll also need some ID to tell one row from the other. In some DBMS there is a unique row ID built in. I don't know how it is in dolphin, so maybe you need a custom ID in your table. Then you can extend above statement as follows:

delete from mytable subset
where exists
(
 select *
 from mytable superset
 where subset.prices < superset.prices
 or (subset.prices = superset.prices and subset.id < superset.id)
);

CollectivesTM on Stack Overflow

How to remove duplicate rows based on array column subset relationship?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related