Diagnosing and Optimizing Slow SQL Queries
Key Points
- Slow queries become a critical bottleneck as data volumes grow, so developers, data scientists, engineers, and DBAs must continuously tune SQL for performance and cost control.
- The first step in fixing a sluggish query is proper diagnosis using the SQL EXPLAIN command to view the detailed execution plan.
- A common red flag is a large disparity between rows scanned and rows returned, indicating unnecessary data scanning that inflates runtime.
- Excessive sorting in the plan signals high RAM consumption and should be addressed through index or query redesign.
- Full‑table scans are another major inefficiency; eliminating them with appropriate indexing or predicate refinement is a top priority before tackling more complex optimizations.
Sections
- Standardizing Slow Query Diagnosis - The speaker outlines a step‑by‑step, priority‑driven process—starting with using EXPLAIN—to accurately diagnose and optimize sluggish SQL queries for better performance and cost control.
- Analyzing Query Plans for Optimization - It outlines how examining query execution plans—showing costs, resource usage, and row counts—lets you pinpoint inefficiencies, beginning with syntax tweaks before moving on to larger architectural changes.
- Query Optimization and Index Benefits - The speaker details how filtering with WHERE clauses reduces scanned rows and then explains that adding an index pre‑sorts data, enabling the database to locate records quickly instead of scanning each row.
- Strategic Indexing and Partitioning Guidance - The speaker advises adding indexes only on columns used in WHERE clauses, balancing performance gains against write overhead, and suggests considering table partitioning for massive datasets when indexing alone isn’t enough.
- When to Use Partitioning and Redesign - The speaker explains that after optimizing queries and indexes, adding partitions can boost performance but demands careful planning and team consensus, and if issues remain, a complete data‑structure redesign may be required.
- Optimizing Queries for AI Foundations - The speaker emphasizes mastering techniques to ensure query performance, low runtime costs, and a solid base for AI development.
Full Transcript
# Diagnosing and Optimizing Slow SQL Queries **Source:** [https://www.youtube.com/watch?v=watwW4Hwyyw](https://www.youtube.com/watch?v=watwW4Hwyyw) **Duration:** 00:15:51 ## Summary - Slow queries become a critical bottleneck as data volumes grow, so developers, data scientists, engineers, and DBAs must continuously tune SQL for performance and cost control. - The first step in fixing a sluggish query is proper diagnosis using the SQL EXPLAIN command to view the detailed execution plan. - A common red flag is a large disparity between rows scanned and rows returned, indicating unnecessary data scanning that inflates runtime. - Excessive sorting in the plan signals high RAM consumption and should be addressed through index or query redesign. - Full‑table scans are another major inefficiency; eliminating them with appropriate indexing or predicate refinement is a top priority before tackling more complex optimizations. ## Sections - [00:00:00](https://www.youtube.com/watch?v=watwW4Hwyyw&t=0s) **Standardizing Slow Query Diagnosis** - The speaker outlines a step‑by‑step, priority‑driven process—starting with using EXPLAIN—to accurately diagnose and optimize sluggish SQL queries for better performance and cost control. - [00:03:06](https://www.youtube.com/watch?v=watwW4Hwyyw&t=186s) **Analyzing Query Plans for Optimization** - It outlines how examining query execution plans—showing costs, resource usage, and row counts—lets you pinpoint inefficiencies, beginning with syntax tweaks before moving on to larger architectural changes. - [00:06:17](https://www.youtube.com/watch?v=watwW4Hwyyw&t=377s) **Query Optimization and Index Benefits** - The speaker details how filtering with WHERE clauses reduces scanned rows and then explains that adding an index pre‑sorts data, enabling the database to locate records quickly instead of scanning each row. - [00:09:20](https://www.youtube.com/watch?v=watwW4Hwyyw&t=560s) **Strategic Indexing and Partitioning Guidance** - The speaker advises adding indexes only on columns used in WHERE clauses, balancing performance gains against write overhead, and suggests considering table partitioning for massive datasets when indexing alone isn’t enough. - [00:12:28](https://www.youtube.com/watch?v=watwW4Hwyyw&t=748s) **When to Use Partitioning and Redesign** - The speaker explains that after optimizing queries and indexes, adding partitions can boost performance but demands careful planning and team consensus, and if issues remain, a complete data‑structure redesign may be required. - [00:15:35](https://www.youtube.com/watch?v=watwW4Hwyyw&t=935s) **Optimizing Queries for AI Foundations** - The speaker emphasizes mastering techniques to ensure query performance, low runtime costs, and a solid base for AI development. ## Full Transcript
Slow running queries can be a major bottleneck for any data-driven organization.
And especially as data sets get larger to support AI and automation, this is gonna become increasingly important.
So whether you're a developer, a data scientist, a data engineer, or even a database administrator,
understanding how to tune queries to make sure that they're optimized for
performance is essential to make that we're controlling our runtime costs
and ensuring that we can deliver real-time data insights.
So that's what we're going to go through today,
a method of going through a standardized process to diagnose and optimize our queries to ensure best performance.
So let's walk through a scenario.
You've written a query and you got all the data to return exactly how you want it,
you deployed it, that's great, but unfortunately it's running slowly,
maybe you've gotten feedback from users or members of your team.
So let's go through the steps we need to take to fix that.
And for the sake of time and this video,
we're going to focus on things that are actually within our control and within the scope of the application.
And as we go through fixes, we're going to make sure we basically prioritize them
based on easiest to fix to most complex to actually change.
So first things first, before we even try and solve any problems
or add any optimizations, we want to make that we're diagnosing the problem correctly.
And this is done through a method called explain.
Explain is SQL context.
So you'll see this literally as a explain command that is written.
So you say explain this query,
and from there, you can actually then get a detailed query plan of exactly how your database is going to run that query.
So with our explain method, we can identify a few key red flags.
Probably the thing that's going to come up the most
is showing that your scanned amount of rows
does not equal or is very, very distant from the amount of rows actually returned.
So we can actually see that the amount scanned does not the equal the return.
And this basically means, let's say you're looking for only a couple hundred rows,
Maybe they're still querying through many many millions.
That's what's going to result in that long run time or the inefficiency.
So seeing that in a way is good because that means you have room to improve.
The other method that we really want to look for is basically sorts
and basically with that we're going to see that when we're actually pulling data out in a query
that in order to sort it you have to keep all that memory in RAM. So that's going be very consuming.
That's another red flag that we can tackle with some optimizations.
But if you see that in your query plan, that's another area of optimization.
And then again, the next thing really is a full table scan.
A full table span means that essentially the entire table is being scanned for your query.
So we wanna make sure that we see those can be completely eliminated through some kind of optimization,
and again, I know I was talking about SQL syntax with Explain.
No matter your database, whether it's no SQL or SQL, you're gonna be able to look at a query plan.
And a query planning based on the syntax and the
details within your particular database is always gonna tell you three things.
It's gonna tell all these great functions that it does and then on top of that, for each one,
it's going to break it down by the amount of time it takes,
by the amounts of resources it consumes, is frequently depicted as cost.
and that will be RAM or CPU, and then it'll actually then depict,
hopefully, the number of rows or transactions
that were included in that operation, and you can quickly start
to see where those areas of improvements can really be applied.
So now we can go ahead and start looking at how we can optimize our queries.
And as we go through the structured methodology,
we're going to actually start with things that are probably the most in our control.
And then the later optimizations are going to be more methods that
would probably be a larger team effort or a bigger change to your overall architecture.
So the first thing we're always really going to look at in query optimization
is the query itself, because about 80% of the time, that's why it's running slowly.
So, we're always going to look for...
making sure that our syntax is as optimized as possible.
So one of the things that we want to make sure we do as we're writing a query is that we're filtering as soon as we can.
Because what is a query really?
Is that it's a method of sorting through a database so you have all your different rows.
And most databases, this is quite a bit.
And any kind of query where you're querying just a little bit of information.
you wanna make sure that you're limiting the amount of information that's actually being scanned.
How this is done is really by adding a WHERE clause so that you are only scanning a smaller portion of the data.
So you're not going from one all the way to the bottom,
you're actually just gonna be querying a smaller proportion of the the data,
so making sure you're taking every chance you can
to optimize and filter out any data that isn't relevant to this query is gonna help it run faster.
The other things to look at is always your joins.
Joins are notoriously a problem, where we can find improvements for optimization.
And then we also want to look at once you have your WHERE clauses, how are you actually writing those?
Are they distinct?
We want to make sure we're looking for a set list of things that are clearly defined.
So something to look out for would be an IN clause.
And this basically means that we're looking for something
where something is in a list, and you want to make sure that that list is relatively short,
because once that list becomes very long, you could start to see query inefficiencies.
So by taking a look at your query, you can make those changes, and then you should be rerunning your explain
so that you can actually quantify and see the improvements that these are making.
By adding that WHERE clause and by filtering more, you might see your sorts and your table stands totally be eliminated.
and you should see the value between the amount scanned and returned
get smaller and smaller, so they start to become virtually the same number.
So once we fully optimized our query, the next step is to
decide if we want to add an index for our database to better work with this query.
So what an index is going to do is it's basically going to help the database optimize the sort.
for a particular column so that it can better navigate through when it's doing some kind of lookup or search.
So first, we're going to back up and really think about what an index is.
So in a database, data is basically stored in a completely random order.
The database doesn't know the order.
It's probably the order that it originally was inserted in.
But that's basically how everything is laid out.
And from there, let's say we're looking at a list of countries.
Let's say we want to look up USA.
What we're going to do is we're first going to look in that first row.
Say, is this USA?
Maybe it is, maybe it isn't.
If then we would look at the second row, then the third, then the fourth.
That'll go all the way through until it hits every single row.
And that's a pretty inefficient query, right?
Because it has to check every single record to see if it matches that.
So what an index actually does is it presorts that information.
So that essentially the database knows where things are.
So when it actually doesn't find it, it can help and use that information to drive a more intelligent search.
So let's say we're gonna build an index on country name.
So same use case, we're going to have the data still gonna be listed out.
And the difference here is now it's gonna be sorted in an order that the database can recognize.
And since it's a string, it's going to be from A to Z.
It's gonna an alphabetical order,
and If this was a number, you know, you can make it in ascending or descending order, whatever you want,
but, now when we look up USA.
It's actually going to start in the middle.
And it's going to say, is this USA?
Probably not.
We know USA is probably towards the bottom here.
And because it won't find it, it now knows that it doesn't even need to search this top portion.
It's just going to go about midway through.
It's going look again.
And then it's gonna look again, and it's basically going to hone in on where the value is based on the value that it actually gets.
So this is a much more intelligent search.
With this, it's probably going to find the value of USA within a few lookups, maybe a few dozen.
Whereas with this method, it is going to have to look through every single row pretty unintelligently.
So that's what an index does.
So the reason we don't do this first is because the first thing you have to do here is optimize your query.
Because you can't just put an index on every column.
You're only going to put an an index on the columns that you're searching with.
So these are the columns that are called in your WHERE clauses.
your order buys, your group buys, anything that's your lookup.
And that's where indexes really should be applied.
They're not a silver bullet.
We can't always just keep building indexes because there is an overhead cost associated with it.
Because now every time you write to your database this order has to be maintained.
So there is some backend operations in your database that happen.
So that's why there are some arbitrary rules, like you shouldn't have more than three indexes.
Now, if you have evidence with your explain that by adding an index,
it drastically improves the performance of a query, go ahead and add it.
But this, again, is just part of regular maintenance in a database.
Indexes, just like queries, should be regularly tuned and look for new areas of optimization.
So really, with just these tools, you should see drastic improvement in your query performance.
The next measure you can consider for optimizing your query is by building
partitions on your table.
Now, partitions are much more so a database optimization.
So it's something that probably has to be taken as more of a team effort, a larger conversation.
But when working with a lot of data, sometimes that's what you need to do.
Because even if you've tuned perfectly with your query and your index,
sometimes you just can't sugarcoat querying a billion rows.
You need more optimization
and you need more filters.
So partitioning actually takes care of this more so at the storage level.
So we all, again, like we have our tables.
And in our tables, we have all of our data and all of information.
And so in that table, we actually are going to break it up horizontally is the way to think about it.
So that there are portions of the data that are kept together,
and we're actually going to completely segment them at the store level.
so that the queries can just target one segment without even needing to look into the others.
That's really what a partition is.
So a common example with this is time series.
Let's say you have a use case where there are millions of transactions a day.
Maybe it's a internet of things use case with lots of sensors that are going into a database.
That's gonna result in a lot of different, you know, rows being created or records being generated.
So how can we break that up?
You can essentially partition it by day or even by hour to make sure that you're then breaking up your data into easy pieces.
So let's say like this is one bucket.
That can be your first hour.
You have another bucket.
That's gonna be your second hour.
You know, and this would go on.
And this automatically happens as the data is being inserted, this is a setting you've put in your table at your database design.
So now, if your queries are going to run something like, show me all the records that have happened in the last hour,
you can just search this query, or basically make sure that you're just searching this bucket.
And from there, you're going to know that you've found it, and you never need to even look in any of these other buckets at all.
And this works the same way.
If you ever need to just look at a distinct number of buckets, it doesn't have to be just one.
So again, partitioning is a great tool of basically breaking up when you have really, really large data volumes.
And if queries and indexes aren't doing the trick, partitions could be the next step.
However, it is a major change.
You might have to re-org your tables.
So make sure you work with your team.
But you're going to have enough evidence from Explain and from the optimizations you've already applied
to have a very fruitful conversation, probably with your database administrator, on why partitions need to be built.
and that it could potentially lead to much better performance.
So between queries, indexes, and partitions, you should see great improvement in your query performance.
Now, the last step that really could be taken if you still are seeing query performance issues
would probably be to look into redesigning your data structure itself.
And this is no small undertaking and definitely should be a team decision.
But with all the evidence that you've gathered through this process of optimization,
you should be able to have a very fruitful discussion about how you've taken everything,
you know, every precaution, every optimization available to you, and we're still seeing query issues.
It might be time to actually change how you're storing your data.
So as you're deciding how you can better structure your data, this is a great thing to consider.
mostly think about how data is accessed,
and you want to make sure that data that is frequently accessed together is stored together.
And then you can kind of make that decision.
Do you want to de-norm tables together?
Do you wanna break them up based on relationships?
This is very easy to do if you're probably at the beginning stages of your application development,
but if you've inherited a legacy application or working with something with
many years of historic data, this will be a larger effort.
The other thing to consider would be parallel computing frameworks like Spark or Hadoop.
This is going to add parallelism basically to your queries so that you can query large amounts of data better.
Not a small undertaking.
This will be changing probably your middleware layer or how your data is queried itself.
So both of these are great options, but they do require a large effort from our teams.
So overall...
we have found that through all these different methods, we can certainly improve the performance of our queries.
So we always want to make sure that we're starting with the simple fixes,
what's probably in our control and easiest for us to change
before we go to more complex redesign and our architectural overhauls.
And when we're also thinking about
how are we going to benchmark ourselves, how are monitoring this, we want to leverage the explain method.
so that we're constantly measuring our performance.
And this shouldn't just be done when something goes wrong.
We really should constantly be tuning our queries before things get to that point.
So by mastering all these techniques,
we can make sure that we are all experts
in making sure that our queries are as performant as possible, that we keeping our runtime costs as low as possible.
And that we providing the necessary foundation we're gonna need for AI development.