I have a primary table, REP_ASSIST.ES_LOG
with multiple columns that are foreign keys to another table, REP_ASSIST.ES_DOCUMENT_STATUSES
. Here is an example of what my main table looks like:
+-------------------------+-----------+-------------------------+-----------------------+
| created_at | filing_id | prior_dd_rcvd_status_id | new_dd_rcvd_status_id |
+-------------------------+-----------+-------------------------+-----------------------+
| 2019年09月04日 10:58:48.000 | 988 | 2 | 2 |
| 2019年09月04日 10:47:03.000 | 988 | 1 | 2 |
| 2019年08月28日 23:56:47.000 | 988 | null | 1 |
+-------------------------+-----------+-------------------------+-----------------------+
Both the prior_dd_rcvd_status_id
and new_dd_rcvd_status_id
are foreign keys to the other table, a sample of which is here:
+------------------+--------------------+
| dd_doc_status_id | dd_doc_status_name |
+------------------+--------------------+
| 1 | RECEIVED |
| 2 | MISSING |
| 3 | NOT_NEEDED |
| 4 | UNKNOWN |
+------------------+--------------------+
I need to pull in the values from the lookup table for a printable report; right now I'm using a subquery to get the dd_doc_status_name
as a user-friendly name instead of the dd_doc_status_id
.
SELECT created_at,
filing_id,
(
SELECT dd_doc_status_name
FROM REP_ASSIST.ES_DOCUMENT_STATUSES
WHERE ES_DOCUMENT_STATUSES.dd_doc_status_id = prior_dd_rcvd_status_id
) AS 'prior_dd_status',
(
SELECT dd_doc_status_name
FROM REP_ASSIST.ES_DOCUMENT_STATUSES
WHERE ES_DOCUMENT_STATUSES.dd_doc_status_id = new_dd_rcvd_status_id
) AS 'new_dd_status'
FROM REP_ASSIST.ES_LOG
WHERE filing_id = 988;
However, I'm concerned that using subqueries like this may be a bit of a hack and affect performance (especially if I end up adding additional columns that need this information).
I'm getting the correct results (see below), but I'd like to know if there is a more efficient and more standard way.
+-------------------------+-----------+----------------------+--------------------+
| created_at | filing_id | prior_dd_rcvd_status | new_dd_rcvd_status |
+-------------------------+-----------+----------------------+--------------------+
| 2019年09月04日 10:58:48.000 | 988 | MISSING | MISSING |
| 2019年09月04日 10:47:03.000 | 988 | RECEIVED | MISSING |
| 2019年08月28日 23:56:47.000 | 988 | null | RECEIVED |
+-------------------------+-----------+----------------------+--------------------+
2 Answers 2
I've updated the Fiddle with your initial query and 2 alternatives.
This is the initial query plan:
Using OUTER APPLY
instead of the nested SELECT
statements.
select created_at, prior_dd_rcvd_status_id,
new_dd_rcvd_status_id, prior_dd_status, new_dd_status, filing_id
from ES_LOG
outer apply (
select dd_doc_status_name as prior_dd_status
from ES_DOCUMENT_STATUSES where dd_doc_status_id = prior_dd_rcvd_status_id
) a
outer apply (
select dd_doc_status_name as new_dd_status
from ES_DOCUMENT_STATUSES where dd_doc_status_id = new_dd_rcvd_status_id
) b
where filing_id = 988;
The query plan gets simplified:
And using LEFT OUTER JOIN
instead of the OUTER APPLY
.
select created_at, prior_dd_rcvd_status_id,
new_dd_rcvd_status_id,
a.dd_doc_status_name as prior_dd_status,
b.dd_doc_status_name as new_dd_status,
filing_id
from ES_LOG
left outer join ES_DOCUMENT_STATUSES a
on a.dd_doc_status_id = prior_dd_rcvd_status_id
left outer join ES_DOCUMENT_STATUSES b
on b.dd_doc_status_id = new_dd_rcvd_status_id
where filing_id = 988;
With an even more simplified query plan:
However, as you can see, the table scans take most of the time. So I'm not sure how much performance you could gain by picking either of the alternatives.
Further improvements require the use of indexes (Table Scan vs Index Scan). Deciding which index to provide depends not only on this query, but also the general design of how these tables will be used for other queries and CRUD operations. I would advise you learn about them.
-
1\$\begingroup\$ Great answer! One thing that is probably worth calling out is that with the volumes of these tables, SQL Server is pretty much guaranteed to nested loop and not try to do anything fancy with them. If production table sizes are different then this might be very different, and is at least a grain of salt with looking at the plans. \$\endgroup\$Dan Oberlam– Dan Oberlam2019年09月06日 15:07:42 +00:00Commented Sep 6, 2019 at 15:07
However, I'm concerned that using subqueries like this may be a bit of a hack and affect performance (especially if I end up adding additional columns that need this information).
Whenever you find yourself including a subquery as part of your SELECT
list, you're right to question whether or not you're doing the correct thing. These can often suffer from performance issues, and are much harder to read and troubleshoot.
I wouldn't worry as much about the second half of your question:
especially if I end up adding additional columns that need this information
Unless you know that you're likely to do this in the future, you're likely micro- and pre-optimizing.
That being said, ultimately your solution is very straightforward. All you want is a join from table A
to table B
on two different columns. You might recognize that what you actually need is two joins from table A
to table B
. That would just look like this:
SELECT created_at,
filing_id,
PriorStatuses.dd_doc_status_name prior_dd_status,
NewStatuses.dd_doc_status_name new_dd_status
FROM REP_ASSIST.ES_LOG
LEFT OUTER JOIN REP_ASSIST.ES_DOCUMENT_STATUSES PriorStatuses
ON ES_LOG.prior_dd_rcvd_status_id = PriorStatuses.dd_doc_status_id
LEFT OUTER JOIN REP_ASSIST.ES_DOCUMENT_STATUSES NewStatuses
ON ES_LOG.new_dd_rcvd_status_id = NewStatuses.dd_doc_status_id
WHERE filing_id = 988;
I did my best to come up with meaningful names based on context, but you should just pick some domain-relevant alias as appropriate.
If you do find yourself with a really large number of these columns, you might want to consider a new table design. For example, it might then be easier to have a table like this:
CREATE TABLE REP_ASSIST.ES_LOG_STATUSES
(
filing_id bigint NOT NULL,
doc_status_type nvarchar(50) NOT NULL,
status_id bigint NOT NULL
);
Then you can add new status types (or whatever the appropriate domain terminology would be) ad-hoc to this table, and your query then becomes something like this
SELECT ES_LOG.created_at,
ES_LOG.filing_id,
ES_LOG_STATUSES.doc_status_type,
ES_DOCUMENT_STATUSES.dd_doc_status_name
FROM REP_ASSIST.ES_LOG
INNER JOIN REP_ASSIST.ES_LOG_STATUSES
ON ES_LOG.filing_id = ES_LOG_STATUSES.filing_id
LEFT OUTER JOIN REP_ASSIST.ES_DOCUMENT_STATUSES
ON ES_LOG_STATUSES.status_id = ES_DOCUMENT_STATUSES.status_id
WHERE ES_LOG.filing_id = 988;
If its important to have the output be one row per filing_id
instead of one row per filing_id
and doc_status_type
combination, you could always PIVOT
your result set. If the set of possible status types is unknown or changes rapidly, there are plenty of ways to do that dynamically
left join
s on yourES_DOCUMENT_STATUSES
table instead of the subqueries will be fine, I don't know aboutsql-server
so forgive me from putting a complete answer. \$\endgroup\$