In the world of machine learning and GenAI, vectors and embeddings are fundamental concepts that enable sophisticated data analysis and retrieval techniques such as vector search. This article aims to demystify these concepts and provide a practical demonstration using TiDB’s vector search capabilities.
Understanding Vector Embeddings
What Are Vectors?
In the simplest terms, a vector is an array of numbers that represents a point in space. In the context of machine learning, vectors are used to represent data in a way that captures its essential features in a numerical format. For example, a vector can represent words, images, videos, or any other type of data by encoding relevant features into a numerical array.
What Are Embeddings?
Embeddings are a specific type of vector representation where the data is mapped into a lower-dimensional space. The primary goal of embeddings is to place similar items closer together in this space, allowing for meaningful distance measurements. For instance, in natural language processing (NLP), words with similar meanings are represented by embeddings that are close to each other in the vector space.
How TiDB Uses Vector Search
TiDB’s vector search feature allows you to perform semantic searches and similarity searches on various types of data such as text, images, videos, and more. Instead of searching based on the raw data, vector search operates on the vector representations (embeddings) of the data, making it possible to find semantically similar items efficiently.
Key Concepts in TiDB Vector Search
- Vector Embeddings: Representations of data as vectors in a multi-dimensional space.
- Nearest Neighbor Search: Finding vectors that are closest to a given vector in terms of a defined distance metric.
- Distance Metrics: Methods to measure similarity between vectors. Common metrics include Cosine Distance, Euclidean Distance (L2), and Manhattan Distance (L1).
Real Demo: Using TiDB for Vector Search
Setting Up TiDB for Vector Search
To begin with TiDB vector search, you need to create a TiDB Serverless cluster with vector search support. Follow these steps:
- Request a TiDB Serverless with vector storage support at https://tidb.cloud/ai
- Sign up: Register at tidbcloud.com and select the
eu-central-1
region where vector search is currently available. - Create a Cluster: Follow the tutorial to create a TiDB Serverless Cluster with vector support enabled.
Creating and Querying a Table with Vectors
Here’s a practical example of how to create a table with vector fields, insert data, and perform a vector search.
1.Create a Table:
CREATE TABLE vector_table (
id INT PRIMARY KEY,
doc TEXT,
embedding VECTOR(3)
);
2.Insert Data:
INSERT INTO vector_table VALUES
(1, 'apple', '[1,1,1]'),
(2, 'banana', '[1,1,2]'),
(3, 'dog', '[2,2,2]');
3.Query Data:
To retrieve the nearest neighbors to a given vector, use the following query:
SELECT * FROM vector_table
ORDER BY vec_cosine_distance(embedding, '[1,1,3]')
LIMIT 3;
This query orders the records by their cosine distance to the vector [1,1,3]
, effectively finding the most similar items.
Advanced Usage: Creating an Index for Faster Searches
To optimize vector search performance, you can create an HNSW (Hierarchical Navigable Small World) index:
CREATE TABLE vector_table_with_index (
id INT PRIMARY KEY,
doc TEXT,
embedding VECTOR(3) COMMENT 'hnsw(distance=cosine)'
);
Vector Functions Supported by TiDB
TiDB supports various vector functions, including:
- Vec_L1_Distance: Manhattan Distance
- Vec_L2_Distance: Squared Euclidean Distance
- Vec_Cosine_Distance: Cosine Distance
- Vec_Negative_Inner_Product: Negative Inner Product
These functions allow for flexible and powerful similarity searches across different types of data.
Conclusion
Understanding vectors and embeddings is crucial for leveraging advanced AI and machine learning techniques. TiDB’s vector search feature simplifies the process of implementing semantic search and similarity search on large datasets. By representing data as vectors and using embeddings, you can perform meaningful searches that go beyond simple keyword matching, enabling more sophisticated data retrieval and analysis.
For a detailed guide and more examples on using TiDB vector search, check out the full documentation and join the private beta to explore its capabilities firsthand.
Learn More
There are some real apps / demos to demonstrate how to use vector storage to store embeddings:
- OpenAI Embedding: use the OpenAI embedding model to generate vectors for text data.
- Image Search: use the OpenAI CLIP model to generate vectors for image and text.
- LlamaIndex RAG with UI: use the LlamaIndex to build an RAG(Retrieval-Augmented Generation) application.
- Chat with URL: use LlamaIndex to build an RAG(Retrieval-Augmented Generation) application that can chat with a URL.
- GraphRAG: 20 lines code of using TiDB Serverless to build a Knowledge Graph based RAG application.
- GraphRAG Step by Step Tutorial: Step by step tutorial to build a Knowledge Graph based RAG application with Colab notebook. In this tutorial, you will learn how to extract knowledge from a text corpus, build a Knowledge Graph, store the Knowledge Graph in TiDB Serverless, and search from the Knowledge Graph.