A vector DB is a fast nearest-neighbor index. It's not magic, it's not always needed, and the embedding model matters more than the DB.
11 min · Reviewed 2026
The premise
Vector databases get treated as magic AI infrastructure. They are nearest-neighbor indexes over embeddings. Recall quality depends mostly on the embedding model and chunking strategy, not the DB you pick.
What AI does well here
Return semantically similar chunks for an embedded query
Scale to millions of vectors with the right index
Combine with metadata filters when configured
What AI cannot do
Improve recall when the embeddings are bad
Tell you why a relevant document didn't come back
Replace good chunking and metadata design
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-tools-vector-database-fundamentals-r7a1-creators
What is the fundamental function of a vector database?
It stores original documents and their metadata
It performs nearest-neighbor search over embedded data
It generates embeddings from raw text
It trains machine learning models on your data
Which factor has the greatest impact on recall quality in a vector search system?
The hardware infrastructure (GPU vs CPU)
The specific vector database product chosen
The embedding model and chunking strategy
The amount of metadata stored
Before adopting a dedicated vector database, what does the lesson recommend trying first?
Amazon DynamoDB with vector support
MongoDB with Atlas search
PostgreSQL with the pgvector extension
Elasticsearch with vector plugins
What does 'recall@k' measure?
The percentage of relevant items found within the top k results
The number of vectors the system can store
The time it takes to return k results
The similarity score threshold used for filtering
Why can't a vector database improve recall when the underlying embeddings are low quality?
Vector databases require perfect embeddings to function
The search algorithm automatically corrects errors in embeddings
Vector databases have strict size limits on embeddings
The database can only work with what the embeddings represent — garbage in, garbage out
What problem occurs when key sentences are split across multiple chunks using fixed token count chunking?
The chunks become too large to process efficiently
The embedding model ignores short chunks
The database automatically merges small chunks
Those sentences become unfindable because context is lost
What chunking approach does the lesson recommend for better recall?
Chunk by sentence boundaries only
Fixed token count (256 tokens each)
Chunk by semantic units like paragraphs or sections
Fixed character count (500 characters each)
What is semantic similarity search?
Finding documents created in the same time period
Finding documents that share the same authors
Finding results based on meaning rather than exact wording
Finding exact keyword matches in text
A relevant document fails to appear in search results. What is a likely cause the vector database cannot itself explain?
The document was added after the index was built
The query used the wrong HTTP method
The embedding model failed to capture the document's relevance to that query
The database ran out of storage space
What capability do vector databases offer when combined with metadata filters?
They can narrow results by both semantic similarity and structured criteria
They can guarantee perfect recall
They can convert structured filters into embeddings
They can automatically generate metadata
What is the primary purpose of converting text into embeddings?
To format text for display on websites
To remove personally identifiable information
To compress text for storage efficiency
To enable mathematical similarity comparisons
How do modern vector databases handle scaling to millions of vectors?
They require manual partitioning by the user
They use specialized index structures for approximate nearest-neighbor search
They switch to exact matching at scale
They automatically delete older vectors
What is the benefit of overlapping chunks when chunking documents for vector search?
It reduces storage requirements
It eliminates the need for embedding models
It automatically generates better embeddings
It ensures key ideas aren't split across non-matching chunks
Why is the embedding model considered more important than the database choice?
The database only indexes what the embedding model creates — poor embeddings can't be rescued by database features
Embedding models require less maintenance
Databases have no impact on results
Embedding models are more expensive to run
What does 'nearest-neighbor search' mean in the context of vector databases?
Finding the closest document by word count
Finding documents created by the same author
Finding the most recently added documents
Finding vectors that are mathematically closest in multi-dimensional space