TiDB, a MySQL-compatible database, has introduced a powerful feature for handling high-dimensional data: Vector Search Indexes. This post will explore how TiDB implements these indexes using the Hierarchical Navigable Small World (HNSW) method, and how they can be utilized for efficient nearest neighbor searches.
What are Vector Search Indexes?
Vector Search Indexes are designed to facilitate efficient approximate nearest neighbor (ANN) searches in a vector space. This is particularly useful for applications involving high-dimensional data like image recognition, recommendation systems, and natural language processing. TiDB’s implementation allows such queries to be completed in milliseconds, vastly improving performance over traditional brute force methods.
Join the waitlist for the private beta of built-in vector search in TiDB Serverless.
Creating a HNSW Vector Index in TiDB
TiDB supports the creation of HNSW Vector Indexes using the following SQL syntax:
CREATE TABLE vector_table_with_index (
id INT PRIMARY KEY,
doc TEXT,
embedding VECTOR(3) COMMENT "hnsw(distance=cosine)"
);
Note: The syntax for creating the HNSW Index may change in future releases. It is crucial to specify the distance metric (e.g., cosine or L2) when creating the vector index.
Limitations and Compatibility
Currently, TiDB only supports creating vector indexes with L2 and cosine distances during the table creation. The ability to add or drop vector indexes using DDL commands post-creation is not available yet but is planned for future updates.
Utilizing Vector Indexes
Vector Indexes can be used in SQL queries to perform k-nearest neighbor searches. Here’s an example:
SELECT *
FROM vector_table_with_index
ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]')
LIMIT 10;
It’s important to use the same distance metric defined when creating the index to leverage its benefits fully.
Integration with ORMs
TiDB provides support for various Python ORMs, enabling easier integration into applications:
- TiDB Vector Client for Python: GitHub Link
- SQLAlchemy: GitHub Link
- Peewee: GitHub Link
- Django: GitHub Link
Performance Analysis
To analyze the performance and ensure the Vector Index is being used, you can use the EXPLAIN or EXPLAIN ANALYZE statements in TiDB:
EXPLAIN SELECT * FROM vector_table_with_index
ORDER BY Vec_Cosine_Distance(embedding, '[1, 2, 3]')
LIMIT 10;
Best Practices
To ensure optimal performance, especially when indexes are “cold” (not recently accessed), it’s recommended to “warm up” the index by running similar queries beforehand. Additionally, managing the data set size by using fewer dimensions or compression techniques can help maintain high performance.
Conclusion
Vector Search Indexes in TiDB offer a robust solution for efficiently handling complex queries involving high-dimensional data. By leveraging these indexes, developers can significantly enhance the performance of their applications, making real-time data interaction more feasible.
Real Demos of TiDB Vector Search
- OpenAI Embedding: use the OpenAI embedding model to generate vectors for text data.
- Image Search: use the OpenAI CLIP model to generate vectors for image and text.
- LlamaIndex RAG with UI: use the LlamaIndex to build an RAG(Retrieval-Augmented Generation) application.
- Chat with URL: use LlamaIndex to build an RAG(Retrieval-Augmented Generation) application that can chat with a URL.
- GraphRAG: 20 lines code of using TiDB Serverless to build a Knowledge Graph based RAG application.
- GraphRAG Step by Step Tutorial: Step by step tutorial to build a Knowledge Graph based RAG application with Colab notebook. In this tutorial, you will learn how to extract knowledge from a text corpus, build a Knowledge Graph, store the Knowledge Graph in TiDB Serverless, and search from the Knowledge Graph.