1

I'm trying to write a query to detect possibly-invalid data in a PostgreSQL table. We have a table of city names like this:

# `city_names`
id | name | language | dialect | city_id
------------------------------------------------
01 | London | A | A1 | 1
02 | London | A | A2 | 1
03 | London | B | B1 | 2
04 | London | B | B2 | 3

In our domain:

  • It's fine that rows 01 and 02 both map "London" to city 1; the dialects don't happen to differ
  • It's fine that row 03 maps "London" to city 2; in that language, the name may refer to a different city
  • It's suspicious that row 04 maps "London" to city 3, because we already have a mapping to city 2 in the same language

I want to write a query that selects only rows 03 and 04 so that a human can decide whether one of them points to the wrong city.

I can solve this problem procedurally, but I'm having trouble doing it in SQL. For example, if I GROUP BY language and name, I lose the city_id values from the individual rows.

Basically my goal is: "If there's more than one city_id for the same name and language, list those city_ids."

How can I do this?

asked Sep 29, 2014 at 21:20
1
  • A procedural solution is: make a hash. For each row, ensure the hash has a key with its name and language, like "London-A", and a value for the set of all city_ids matching that. Insert the row's city id. At the end, list the hash keys where the set has more than one item. (Of course, this is horribly slow.) Commented Sep 29, 2014 at 21:23

1 Answer 1

1

This query will do. The trick is to use COUNT(DISTINCT city_id):

SQL Fiddle

PostgreSQL 8.3.20 Schema Setup:

CREATE TABLE Table1
 ("id" int, "name" varchar(6), "language" varchar(1), "dialect" varchar(2), "city_id" int)
;
INSERT INTO Table1
 ("id", "name", "language", "dialect", "city_id")
VALUES
 (01, 'London', 'A', 'A1', 1),
 (02, 'London', 'A', 'A2', 1),
 (03, 'London', 'B', 'B1', 2),
 (04, 'London', 'B', 'B2', 3)
;

Query 1:

select t.*, d.dups
FROM table1 t INNER JOIN
(
 select name, language, count(distinct city_id) as dups
 from table1
 group by name, language
 having count(distinct city_id) > 1
) d
ON t.name = d.name and t.language = d.language 

Results :

| ID | NAME | LANGUAGE | DIALECT | CITY_ID | DUPS |
|----|--------|----------|---------|---------|------|
| 3 | London | B | B1 | 2 | 2 |
| 4 | London | B | B2 | 3 | 2 |
answered Sep 29, 2014 at 22:17
1
  • Awesome! I ended up changing this slightly to create a temporary table suspicious_city_names for the inner query, then join to it. I think this makes the query a bit easier to follow. Thanks for your help! Commented Sep 30, 2014 at 13:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.