7
\$\begingroup\$

I forked this StackExchange Data Explorer (SEDE) query from another one, and made it a bit more complex. I'm looking for feedback on my style, as well as answers to other questions (see below).

-- top users: NYC
-- forked from Avinash Raj's query for Chennai
select
 row_number() over(order by u.Reputation desc) as [#], 
 u.Id as [User Link], 
 u.Reputation,
 u.Location 
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
 Users u
 join 
 Badges b
 on
 u.Id = b.UserId
where
 b.name = N'python'
 and
 b.class = 3 -- 1 is gold, 2: silver, 3: bronze
 and
 u.Reputation >= 1000 
 and 
 (
 lower(u.Location) like '%nyc%'
 or upper(Location) like '%NY, NY%'
 or lower(Location) like '%staten%'
 or lower(Location) like '%bronx%'
 or lower(Location) like '%queens%'
 or lower(Location) like '%new york, ny%'
 or lower(Location) like '%manhattan%'
 or lower(Location) like '%brooklyn%'
 )
order by
 u.Reputation desc;

Questions:

  • Would this sql work with all relational DB's that accept SQL?
  • If not, what is non-standard? How would I standardize it?
  • I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?
  • Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?

I have just got some feedback on using table identifiers so that people don't have to guess, so I'll try to do that from now on. Am I doing it anywhere that's silly?

I was asked about allowing user input for the badge name - Yes, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don't have a clear idea how to implement mapping gold/silver/bronze to 1/2/3. I also have false positives and was hoping someone could point me at a more concise and/or accurate way to filter.

asked Jan 25, 2016 at 14:22
\$\endgroup\$
6
  • \$\begingroup\$ Why did you make the badge name a constant, instead of allowing the user to enter whatever badge name they want? \$\endgroup\$ Commented Jan 25, 2016 at 14:36
  • \$\begingroup\$ Yep, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don't have a clear idea how to implement mapping gold/silver/bronze to 1/2/3. \$\endgroup\$ Commented Jan 25, 2016 at 14:49
  • \$\begingroup\$ Does your database ignore case in comparisons? I know that SQL Server by default treats "Brooklyn" and "brooklyn" as equal. \$\endgroup\$ Commented Jan 26, 2016 at 20:58
  • \$\begingroup\$ SEDE does not ignore case, unfortunately. I'm not sure why Stack Exchange set it up like that. \$\endgroup\$ Commented Jan 26, 2016 at 21:22
  • \$\begingroup\$ @Phrancis, I only mentioned it because one of the question tags is sql-server. I've never heard of SEDE. \$\endgroup\$ Commented Jan 26, 2016 at 21:58

2 Answers 2

7
\$\begingroup\$

Questions

First to address some of your questions:

  1. Would this sql work with all relational DB's that accept SQL?

No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example, row_number() over(order by u.Reputation desc) only works with Transact-SQL language, it would fail with syntax error on every other DBMS that I know of.

The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.

  1. If not, what is non-standard? How would I standardize it?

See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.

  1. I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?

No, it is not problematic. Just be consistent with whatever style you use.

  1. Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?

The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to lower() and upper() are needed because SEDE has case-sensitivity enabled.


Review

from -- denormalize the Users with their Badges, i.e., flatten into a single table.
 Users u
 join 
 Badges b

This from clause is a bit ugly. The aliases u and b are not very good names. It's a good habit to use meaningful names for aliases, which are meant to be identifiers rather than just a way to shorten the references. Here with having such short table names, an alias seems like overkill. It is also a good habit to use the as keyword and to specify the schema (although in the case of SEDE they are all the same schema, some other databases have tables organized in multiple schemes and it can become ambiguous).

from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId

[Rank] would make a better column name than [#].


Your order by clause at the end is redundant, you are already sorting in your row_number() clause, so no need to sort them again. orber by is expensive so you would want to limit the amount of ordering as much as you can.


We can extract the badge name and minimum reputation into variables:

declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;

And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.

I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the min() aggregate and resulting group by clauses:

case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],

Now your result set will look like this:

Rank User Link Reputation Highest Badge Location 
---- ----------------------- ---------- ------------- --------------------------- 
1 cletus 368937 Bronze New York, NY 
2 Triptych 96145 Gold NYC 
3 Claudiu 81412 Gold New York, NY 
4 Ben Hoffstein 59650 Bronze New York, NY 
5 Yuji 'Tomita' Tomita 59318 Gold New York, NY 
6 danben 42686 Bronze New York, NY 
7 David Robinson 42489 Gold New York, NY 
8 Larry Lustig 33544 Bronze New York, NY 
9 Aaron Hall 32035 Gold New York, NY, United States 
10 chown 31348 Silver New York, NY

Finally, we can use a bit of SEDE magic to parametrize the query:

-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;

This will let you search other badges and different rep thresholds.


Everything combined (demo):

-- top users in NYC region for desired tag (default Python)
-- forked from Avinash Raj's query for Chennai
-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;
select
 row_number() over(order by usr.Reputation desc) as [Rank], 
 usr.Id as [User Link], 
 usr.Reputation,
 case min(bdg.Class)
 when 1 then 'Gold'
 when 2 then 'Silver'
 when 3 then 'Bronze'
 else 'unknown' end as [Highest Badge],
 usr.Location 
from 
 dbo.Users as usr
 -- denormalize the Users with their Badges, i.e., flatten into a single table.
 join dbo.Badges as bdg
 on usr.Id = bdg.UserId
where
 bdg.name = @targetBadgeName
 and
 usr.Reputation >= @minimumReputation
 and 
 (
 lower(usr.Location) like '%nyc%'
 or upper(usr.Location) like '%NY, NY%'
 or lower(usr.Location) like '%staten%'
 or lower(usr.Location) like '%bronx%'
 or lower(usr.Location) like '%queens%'
 or lower(usr.Location) like '%new york, ny%'
 or lower(usr.Location) like '%manhattan%'
 or lower(usr.Location) like '%brooklyn%'
 )
group by
 usr.Id,
 usr.Reputation,
 usr.Location;
answered Jan 26, 2016 at 18:48
\$\endgroup\$
2
  • \$\begingroup\$ Dang, nice answer. You didn't really seem to change my indentation, is that ok? Also, can one factor out the lower(usr.Location) part? \$\endgroup\$ Commented Jan 26, 2016 at 18:51
  • \$\begingroup\$ Your indentation seems fine to me. As for the lower() function calls, SEDE is setup to be case-sensitive, so there is not really an efficient workaround. It's a good thing those function calls are inexpensive though. \$\endgroup\$ Commented Jan 26, 2016 at 19:07
3
\$\begingroup\$

In my opinion, your where clauses take up too much vertical space and can be made more maintainable. Compare what you have with this:

where
 b.name = N'python'
 and b.class = 3 -- 1 is gold, 2: silver, 3: bronze
 and u.Reputation >= 1000 
 and 
 (
 lower(u.Location) like '%nyc%'
 or upper(Location) like '%NY, NY%'
 or lower(Location) like '%staten%'
 or lower(Location) like '%bronx%'
 or lower(Location) like '%queens%'
 or lower(Location) like '%new york, ny%'
 or lower(Location) like '%manhattan%'
 or lower(Location) like '%brooklyn%'
 )

Suppose you're working on this query and you want to remove the Reputation condition to see what happens. In your version, you'd have to comment out two lines. In mine you'd only have to comment out one, although this doesn't work with your large and condition, nor does it work with your first condition (the only one that doesn't start with and). Note that it's a matter of convention whether your conditions put the and at the beginning or the end of the line.

I've seen some people write their where clauses like this:

where
 1=1
 and b.name = N'python'
 ...

In this way, it's easy to comment any single-line conditions: you just put a -- at the front. This is a step too far for me. I just put the conditions I'm most sure of at the top.

answered Jan 26, 2016 at 20:56
\$\endgroup\$
1
  • \$\begingroup\$ where 1=1 is OK when developing code but it infuriates me when I see that in production code! \$\endgroup\$ Commented Jan 26, 2016 at 21:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.