I'm trying to write a query to detect possibly-invalid data in a PostgreSQL table. We have a table of city names like this:
# `city_names`
id | name | language | dialect | city_id
------------------------------------------------
01 | London | A | A1 | 1
02 | London | A | A2 | 1
03 | London | B | B1 | 2
04 | London | B | B2 | 3
In our domain:
- It's fine that rows 01 and 02 both map "London" to city 1; the dialects don't happen to differ
- It's fine that row 03 maps "London" to city 2; in that language, the name may refer to a different city
- It's suspicious that row 04 maps "London" to city 3, because we already have a mapping to city 2 in the same language
I want to write a query that selects only rows 03 and 04 so that a human can decide whether one of them points to the wrong city.
I can solve this problem procedurally, but I'm having trouble doing it in SQL. For example, if I GROUP BY
language and name, I lose the city_id
values from the individual rows.
Basically my goal is: "If there's more than one city_id for the same name and language, list those city_ids."
How can I do this?
-
A procedural solution is: make a hash. For each row, ensure the hash has a key with its name and language, like "London-A", and a value for the set of all city_ids matching that. Insert the row's city id. At the end, list the hash keys where the set has more than one item. (Of course, this is horribly slow.)Nathan Long– Nathan Long2014年09月29日 21:23:09 +00:00Commented Sep 29, 2014 at 21:23
1 Answer 1
This query will do. The trick is to use COUNT(DISTINCT city_id)
:
PostgreSQL 8.3.20 Schema Setup:
CREATE TABLE Table1
("id" int, "name" varchar(6), "language" varchar(1), "dialect" varchar(2), "city_id" int)
;
INSERT INTO Table1
("id", "name", "language", "dialect", "city_id")
VALUES
(01, 'London', 'A', 'A1', 1),
(02, 'London', 'A', 'A2', 1),
(03, 'London', 'B', 'B1', 2),
(04, 'London', 'B', 'B2', 3)
;
Query 1:
select t.*, d.dups
FROM table1 t INNER JOIN
(
select name, language, count(distinct city_id) as dups
from table1
group by name, language
having count(distinct city_id) > 1
) d
ON t.name = d.name and t.language = d.language
Results :
| ID | NAME | LANGUAGE | DIALECT | CITY_ID | DUPS |
|----|--------|----------|---------|---------|------|
| 3 | London | B | B1 | 2 | 2 |
| 4 | London | B | B2 | 3 | 2 |
-
Awesome! I ended up changing this slightly to create a temporary table
suspicious_city_names
for the inner query, then join to it. I think this makes the query a bit easier to follow. Thanks for your help!Nathan Long– Nathan Long2014年09月30日 13:36:26 +00:00Commented Sep 30, 2014 at 13:36