Learning Library

← Back to Library

Vector Databases: Bridging the Semantic Gap

9m • Unknown Channel • databases • tutorial • beginner • Watch on YouTube ↗

Key Points

Vector databases store data as mathematical vector embeddings—arrays of numbers—that capture the semantic essence of unstructured items like images, text, and audio.
Traditional relational databases rely on structured metadata and manual tags, which creates a “semantic gap” that makes it difficult to query for nuanced concepts such as similar color palettes or scene content.
In a vector space, similar items are positioned close together and dissimilar items far apart, enabling similarity searches through simple distance calculations.
By converting complex, unstructured data into embeddings and indexing them in a vector database, you can perform efficient, semantically aware retrieval that goes beyond exact keyword matching.
This approach allows queries like “find images with similar landscapes” or “retrieve audio clips with alike tones,” which are impractical with conventional SQL‑style queries.

Sections

Full Transcript

# Vector Databases: Bridging the Semantic Gap **Source:** [https://www.youtube.com/watch?v=gl1r1XV0SLw](https://www.youtube.com/watch?v=gl1r1XV0SLw) **Duration:** 00:09:36 ## Summary - Vector databases store data as mathematical vector embeddings—arrays of numbers—that capture the semantic essence of unstructured items like images, text, and audio. - Traditional relational databases rely on structured metadata and manual tags, which creates a “semantic gap” that makes it difficult to query for nuanced concepts such as similar color palettes or scene content. - In a vector space, similar items are positioned close together and dissimilar items far apart, enabling similarity searches through simple distance calculations. - By converting complex, unstructured data into embeddings and indexing them in a vector database, you can perform efficient, semantically aware retrieval that goes beyond exact keyword matching. - This approach allows queries like “find images with similar landscapes” or “retrieve audio clips with alike tones,” which are impractical with conventional SQL‑style queries. ## Sections - [00:00:00](https://www.youtube.com/watch?v=gl1r1XV0SLw&t=0s) **Bridging the Semantic Gap** - The speaker explains how relational databases store image files and basic metadata but fail to capture semantic context, highlighting the need for vector databases to enable similarity‑based queries. - [00:03:20](https://www.youtube.com/watch?v=gl1r1XV0SLw&t=200s) **Understanding Image Vector Embeddings** - The speaker explains that image embeddings are numeric vectors whose dimensions encode learned visual features, illustrating this by comparing a mountain scene and a beach sunset and showing how similar dimensions reflect shared attributes like warm colors. - [00:06:25](https://www.youtube.com/watch?v=gl1r1XV0SLw&t=385s) **Layered Feature Extraction & Vector Indexing** - The passage explains how embedding models progressively abstract data into high‑dimensional vectors and why vector indexing is essential for fast similarity search across massive vector databases. - [00:09:28](https://www.youtube.com/watch?v=gl1r1XV0SLw&t=568s) **Dual Role of Vector Stores** - The passage explains that these systems act both as repositories for unstructured data and as engines for rapid, semantic retrieval. ## Full Transcript

0:00What is a vector database? 0:02Well, they say a picture is worth a thousand words. 0:04So let's start with one. 0:06Now in case you can't tell, this is a picture of a sunset on a mountain vista. 0:12Beautiful. 0:13Now let's say this is a digital image and we want to store it. 0:18We want to put it into a database and we're going to use a traditional database here called a relational database. 0:29Now what can we store in that relational database of this picture? 0:34Well we can put the actual picture binary data into our database to start with, 0:41so this is the actual image file but we can also store some other information as well 0:45like some basic metadata about the picture so that would be. 0:50things like the file format and the date that it was created, stuff like that. 0:55And we can also add some manually added tags to this as well. 1:01So we could say, let's have tags for sunset and landscape and orange, 1:07and that sort of gives us a basic way to be able to retrieve this image, 1:12but it kind of largely misses the images overall semantic context. 1:17Like how would you query for images with similar color palettes for example using this information 1:23or images with landscapes of mountains in the background for example. 1:28Those concepts aren't really represented very well in these structured fields 1:34and that disconnect between how computers store data how humans understand it has a name. 1:41It's called the semantic gap. 1:45Now traditional database queries like select star where color equals orange, 1:52it kind of falls short because it doesn't really capture the nuanced multi-dimensional nature of unstructured data. 2:00Well, that's where vector databases come in by representing data as mathematical vector embeddings. 2:11and what vector embeddings are, 2:16it's essentially an array of numbers. 2:19Now these vectors, they capture the semantic essence of the data where 2:23similar items are positioned close together in vector space and dissimilar items are positioned far apart, 2:30and with vector databases, we can perform similarity searches as mathematical operations, 2:36looking for vector embeddings that are close to each other, 2:39and that kind of translates to finding semantically similar content. 2:43Now we can represent 2:45all sorts of unstructured data in a vector database. 2:49What could we put in here? 2:51Well image files of course like our mountain sunset. 2:56We could put in a text file as well or we could even store audio files as well in here. 3:04Well this is unstructured data and these complex objects They are actually transformed into vector embeddings, 3:15and those vector embeddings are then stored in the vector database. 3:21So what do these vector embeddings look like? 3:24Well, I said there are arrays of numbers 3:27and there are arrays of numbers where each position represents some kind of learned feature. 3:31So let's take a simplified example. 3:35So remember our mountain picture here? 3:38Yep, we can represent that as a vector embedding. 3:42Now, let's say that the vector embedding for the mountain has a first dimension of say 0.91, 3:50then let's say the next one is 0.15, and then there's a third dimension of 0.83 and kind of so forth. 3:59What does all that mean? 4:00Well, the 0.91 in the first dimension, that indicates significant elevation changes because, hey, this is the mountains. 4:10Then 0.15 The second dimension here, that shows few urban elements, 4:16don't see many buildings here, so that's why that score is quite low. 4:200.83 in the third dimension, that represents strong warm colors like a sunset and so on. 4:27All sorts of other dimensions can be added as well. 4:30Now we could compare that to a different picture. 4:33What about this one, which is a sunset at the beach? 4:37So let's have a look at the vector embeddings for the beach example. 4:43So this would also have a series of dimensions. 4:46Let's say the first one is 0.12, then we have a 0.08, and then finally we have a 0.89 and then more dimensions to follow. 4:59Now, notice how there are some similarities here. 5:02The third dimension, 0.83 and 0.89, pretty similar. 5:09That's because they both have warm colors. 5:11They're both pictures of sunsets, 5:14but the first dimension that differs quite a lot here 5:18because a beach has minimal elevation changes compared to the mountains. 5:24Now this is a very simplified example. 5:26In real machine learning systems vector embeddings typically contain hundreds or even thousands of dimensions 5:33and I should also say that individual dimensions like this they rarely correspond 5:37to such clearly interpretable features, but you get the idea. 5:42And this all brings up the question of how are these vector embeddings actually created? 5:48Well, the answer is through embedding models that have been trained on massive data sets. 5:53So each type of data has its own specialized type of embedding model that we can use. 6:02So I'm gonna give you some examples of those. 6:06For example, Clip. 6:07You might use Clip for images. 6:10if you're working with text, you might use GloVe, and if you're working with audio, you might use Wav2vec 6:21These processes are all kind of pretty similar. 6:25Basically, you have data that passes through multiple layers. 6:30And as it goes through the layers of the embedding model, each layer is extracting progressively more abstract features. 6:38So for images, the early layers might detect some pretty basic stuff, like let's say edges, 6:45and then as we get to deeper layers, we would recognize more complex stuff, like maybe entire objects. 6:53perhaps for text these early layers would figure out the words that we're looking at, individual words, 7:01but then later deeper layers would be able to figure out context and meaning, 7:07and how this essentially works is we take the high dimensional vectors from this deeper layer here, 7:16and those high dimensional vectors often have hundreds 7:19or maybe even thousands of dimensions that capture the essential characteristics of the input. 7:25Now we have vector embeddings created. 7:28We can perform all sorts of powerful operations that just weren't possible with those traditional relational databases, 7:34things like similarity search, where we can find items 7:37that are similar to a query item by finding the closest vectors in the space. 7:42But when you have millions of vectors in your database and those vectors are made up of hundred or maybe even 7:50thousands of dimensions, 7:53you can't effectively and efficiently compare your query vector to every single vector in the database. 8:00It would just be too slow. 8:02So there is a process to do that and it's called vector indexing. 8:09Now this is where vector indexing uses something called approximate nearest neighbor or ANN algorithms 8:16and instead of finding the exact closest match 8:20these algorithms quickly find vectors that are very likely to be among the closest matches. 8:26Now there are a bunch of approaches for this. 8:29For example, HNSW, that is Hierarchical Navigable Small World that creates multi-layered graphs connecting similar vectors, 8:39and there's also IVF, 8:41that's Inverted File Index, which divides the vector space into clusters and only searches the most relevant of those clusters. 8:50These indexing methods, they basically are trading a small amount of accuracy 8:54for pretty big improvements in search speed. 8:57Now, vector databases are a core feature of something called RAG, retrieval augmented generation, 9:06where vector databases store chunks of documents 9:09and articles and knowledge bases as embeddings and 9:13then when a user asks a question, the system finds the relevant text chunks by comparing vector similarity? 9:20and feeds those to a large language model to generate responses using the retrieved information. 9:26So that's vector databases. 9:29They are both a place to store unstructured data and a place to retrieve it quickly and semantically.