I have a DolphinDB table with an array vector column. I need to remove duplicate rows based on subset relationships within that column.
Sample Input:
| sym | prices |
|---|---|
| a | [3,4,5,6] |
| a | [3,4,5] |
| a | [2,4,5,6] |
| a | [5,6] |
| a | [7,9] |
| a | [7,9] |
Expected Output:
| sym | prices |
|---|---|
| a | [3,4,5,6] |
| a | [2,4,5,6] |
| a | [7,9] |
Deduplication Logic:
Subset Removal: If a row's
pricesarray is a subset (i.e., fully contained) of another row'spricesarray, remove the subset row. In the example,[3,4,5]is a subset of[3,4,5,6], so it is removed; similarly,[5,6]is also a subset of[3,4,5,6]and is removed.Full Duplicate Removal: If multiple rows have identical prices arrays, keep only one.
What I've Tried:
I considered using group by to remove exact duplicates, but this approach cannot handle subset relationships.
Core Question:
How can I perform this subset-based deduplication?
1 Answer 1
Disclaimer: I don't know DolphinDB.
You want to remove real subsets from the table. According to the docs (https://docs.dolphindb.com/en/Programming/Operators/OperatorReferences/lt.html) you can use the less-than operator for this:
delete from mytable subset
where exists
(
select *
from mytable superset
where subset.prices < superset.prices
);
(If you only want to compare price vectors for the same sym, you must add and subset.sym = superset.sym to the subquery of course.)
You also want to remove duplicate sets and only keep one. For this you'll need an additional condition for equal sets (=), but then you'll also need some ID to tell one row from the other. In some DBMS there is a unique row ID built in. I don't know how it is in dolphin, so maybe you need a custom ID in your table. Then you can extend above statement as follows:
delete from mytable subset
where exists
(
select *
from mytable superset
where subset.prices < superset.prices
or (subset.prices = superset.prices and subset.id < superset.id)
);