Top Python badged users from NYC

Question 1

I forked this StackExchange Data Explorer (SEDE) query from another one, and made it a bit more complex. I'm looking for feedback on my style, as well as answers to other questions (see below).

-- top users: NYC
-- forked from Avinash Raj's query for Chennai
select
 row_number() over(order by u.Reputation desc) as [#], 
 u.Id as [User Link], 
 u.Reputation,
 u.Location 
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
 Users u
 join 
 Badges b
 on
 u.Id = b.UserId
where
 b.name = N'python'
 and
 b.class = 3 -- 1 is gold, 2: silver, 3: bronze
 and
 u.Reputation >= 1000 
 and 
 (
 lower(u.Location) like '%nyc%'
 or upper(Location) like '%NY, NY%'
 or lower(Location) like '%staten%'
 or lower(Location) like '%bronx%'
 or lower(Location) like '%queens%'
 or lower(Location) like '%new york, ny%'
 or lower(Location) like '%manhattan%'
 or lower(Location) like '%brooklyn%'
 )
order by
 u.Reputation desc;

Questions:

Would this sql work with all relational DB's that accept SQL?
If not, what is non-standard? How would I standardize it?
I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?
Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?

I have just got some feedback on using table identifiers so that people don't have to guess, so I'll try to do that from now on. Am I doing it anywhere that's silly?

I was asked about allowing user input for the badge name - Yes, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don't have a clear idea how to implement mapping gold/silver/bronze to 1/2/3. I also have false positives and was hoping someone could point me at a more concise and/or accurate way to filter.

Question 2

Why did you make the badge name a constant, instead of allowing the user to enter whatever badge name they want?

Question 3

Yep, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don't have a clear idea how to implement mapping gold/silver/bronze to 1/2/3.

Question 4

Does your database ignore case in comparisons? I know that SQL Server by default treats "Brooklyn" and "brooklyn" as equal.

Question 5

SEDE does not ignore case, unfortunately. I'm not sure why Stack Exchange set it up like that.

Question 6

@Phrancis, I only mentioned it because one of the question tags is sql-server. I've never heard of SEDE.

Question 7

Questions

First to address some of your questions:

Would this sql work with all relational DB's that accept SQL?

No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example, row_number() over(order by u.Reputation desc) only works with Transact-SQL language, it would fail with syntax error on every other DBMS that I know of.

The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.

If not, what is non-standard? How would I standardize it?

See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.

I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?

No, it is not problematic. Just be consistent with whatever style you use.

Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?

The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to lower() and upper() are needed because SEDE has case-sensitivity enabled.

Review

from -- denormalize the Users with their Badges, i.e., flatten into a single table.
 Users u
 join 
 Badges b

This from clause is a bit ugly. The aliases u and b are not very good names. It's a good habit to use meaningful names for aliases, which are meant to be identifiers rather than just a way to shorten the references. Here with having such short table names, an alias seems like overkill. It is also a good habit to use the as keyword and to specify the schema (although in the case of SEDE they are all the same schema, some other databases have tables organized in multiple schemes and it can become ambiguous).

from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId

[Rank] would make a better column name than [#].

Your order by clause at the end is redundant, you are already sorting in your row_number() clause, so no need to sort them again. orber by is expensive so you would want to limit the amount of ordering as much as you can.

We can extract the badge name and minimum reputation into variables:

declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;

And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.

I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the min() aggregate and resulting group by clauses:

case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],

Now your result set will look like this:

Rank User Link Reputation Highest Badge Location 
---- ----------------------- ---------- ------------- --------------------------- 
1 cletus 368937 Bronze New York, NY 
2 Triptych 96145 Gold NYC 
3 Claudiu 81412 Gold New York, NY 
4 Ben Hoffstein 59650 Bronze New York, NY 
5 Yuji 'Tomita' Tomita 59318 Gold New York, NY 
6 danben 42686 Bronze New York, NY 
7 David Robinson 42489 Gold New York, NY 
8 Larry Lustig 33544 Bronze New York, NY 
9 Aaron Hall 32035 Gold New York, NY, United States 
10 chown 31348 Silver New York, NY

Finally, we can use a bit of SEDE magic to parametrize the query:

-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;

This will let you search other badges and different rep thresholds.

Everything combined (demo):

-- top users in NYC region for desired tag (default Python)
-- forked from Avinash Raj's query for Chennai
-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;
select
 row_number() over(order by usr.Reputation desc) as [Rank], 
 usr.Id as [User Link], 
 usr.Reputation,
 case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],
 usr.Location 
from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId
where
 bdg.name = @targetBadgeName
 and
 usr.Reputation >= @minimumReputation
 and 
 (
 lower(usr.Location) like '%nyc%'
 or upper(usr.Location) like '%NY, NY%'
 or lower(usr.Location) like '%staten%'
 or lower(usr.Location) like '%bronx%'
 or lower(usr.Location) like '%queens%'
 or lower(usr.Location) like '%new york, ny%'
 or lower(usr.Location) like '%manhattan%'
 or lower(usr.Location) like '%brooklyn%'
 )
group by
 usr.Id,
 usr.Reputation,
 usr.Location;

Question 8

Dang, nice answer. You didn't really seem to change my indentation, is that ok? Also, can one factor out the lower(usr.Location) part?

Question 9

Your indentation seems fine to me. As for the lower() function calls, SEDE is setup to be case-sensitive, so there is not really an efficient workaround. It's a good thing those function calls are inexpensive though.

Question 10

In my opinion, your where clauses take up too much vertical space and can be made more maintainable. Compare what you have with this:

where
 b.name = N'python'
 and b.class = 3 -- 1 is gold, 2: silver, 3: bronze
 and u.Reputation >= 1000 
 and 
 (
 lower(u.Location) like '%nyc%'
 or upper(Location) like '%NY, NY%'
 or lower(Location) like '%staten%'
 or lower(Location) like '%bronx%'
 or lower(Location) like '%queens%'
 or lower(Location) like '%new york, ny%'
 or lower(Location) like '%manhattan%'
 or lower(Location) like '%brooklyn%'
 )

Suppose you're working on this query and you want to remove the Reputation condition to see what happens. In your version, you'd have to comment out two lines. In mine you'd only have to comment out one, although this doesn't work with your large and condition, nor does it work with your first condition (the only one that doesn't start with and). Note that it's a matter of convention whether your conditions put the and at the beginning or the end of the line.

I've seen some people write their where clauses like this:

where
 1=1
 and b.name = N'python'
 ...

In this way, it's easy to comment any single-line conditions: you just put a -- at the front. This is a step too far for me. I just put the conditions I'm most sure of at the top.

Question 11

where 1=1 is OK when developing code but it infuriates me when I see that in production code!

Phrancis Phrancis 20.5k6 gold badges69 silver badges155 bronze badges · Accepted Answer · 2016-01-26 18:48:48Z

Questions

First to address some of your questions:

Would this sql work with all relational DB's that accept SQL?

No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example, row_number() over(order by u.Reputation desc) only works with Transact-SQL language, it would fail with syntax error on every other DBMS that I know of.

The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.

If not, what is non-standard? How would I standardize it?

See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.

I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?

No, it is not problematic. Just be consistent with whatever style you use.

Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?

The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to lower() and upper() are needed because SEDE has case-sensitivity enabled.

Review

from -- denormalize the Users with their Badges, i.e., flatten into a single table.
 Users u
 join 
 Badges b

This from clause is a bit ugly. The aliases u and b are not very good names. It's a good habit to use meaningful names for aliases, which are meant to be identifiers rather than just a way to shorten the references. Here with having such short table names, an alias seems like overkill. It is also a good habit to use the as keyword and to specify the schema (although in the case of SEDE they are all the same schema, some other databases have tables organized in multiple schemes and it can become ambiguous).

from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId

[Rank] would make a better column name than [#].

Your order by clause at the end is redundant, you are already sorting in your row_number() clause, so no need to sort them again. orber by is expensive so you would want to limit the amount of ordering as much as you can.

We can extract the badge name and minimum reputation into variables:

declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;

And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.

I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the min() aggregate and resulting group by clauses:

case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],

Now your result set will look like this:

Rank User Link Reputation Highest Badge Location 
---- ----------------------- ---------- ------------- --------------------------- 
1 cletus 368937 Bronze New York, NY 
2 Triptych 96145 Gold NYC 
3 Claudiu 81412 Gold New York, NY 
4 Ben Hoffstein 59650 Bronze New York, NY 
5 Yuji 'Tomita' Tomita 59318 Gold New York, NY 
6 danben 42686 Bronze New York, NY 
7 David Robinson 42489 Gold New York, NY 
8 Larry Lustig 33544 Bronze New York, NY 
9 Aaron Hall 32035 Gold New York, NY, United States 
10 chown 31348 Silver New York, NY

Finally, we can use a bit of SEDE magic to parametrize the query:

-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;

This will let you search other badges and different rep thresholds.

Everything combined (demo):

-- top users in NYC region for desired tag (default Python)
-- forked from Avinash Raj's query for Chennai
-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;
select
 row_number() over(order by usr.Reputation desc) as [Rank], 
 usr.Id as [User Link], 
 usr.Reputation,
 case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],
 usr.Location 
from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId
where
 bdg.name = @targetBadgeName
 and
 usr.Reputation >= @minimumReputation
 and 
 (
 lower(usr.Location) like '%nyc%'
 or upper(usr.Location) like '%NY, NY%'
 or lower(usr.Location) like '%staten%'
 or lower(usr.Location) like '%bronx%'
 or lower(usr.Location) like '%queens%'
 or lower(usr.Location) like '%new york, ny%'
 or lower(usr.Location) like '%manhattan%'
 or lower(usr.Location) like '%brooklyn%'
 )
group by
 usr.Id,
 usr.Reputation,
 usr.Location;

Dang, nice answer. You didn't really seem to change my indentation, is that ok? Also, can one factor out the lower(usr.Location) part?
Your indentation seems fine to me. As for the lower() function calls, SEDE is setup to be case-sensitive, so there is not really an efficient workaround. It's a good thing those function calls are inexpensive though.

Stack Exchange Network

Top Python badged users from NYC

2 Answers 2

Questions

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Top Python badged users from NYC

2 Answers 2

Questions

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions