Introduction

In the realm of modern data management, leveraging advanced search capabilities has become paramount. With the rise of artificial intelligence and machine learning, the need for efficient and effective search mechanisms has intensified. TiDB, a cutting-edge distributed SQL database, addresses this demand by introducing vector search capabilities. This article delves into how you can enhance your MySQL database with vector columns using TiDB, unlocking new potentials for your applications.

Understanding Vector Search

Vector search, also known as similarity search or semantic search, allows you to search based on the meaning of the data rather than the data itself. This is particularly useful for applications involving texts, images, videos, and other types of data where traditional keyword-based searches fall short.

In vector search, data is represented as embeddings—numerical vectors that capture the semantic essence of the data. TiDB’s vector search uses these embeddings to perform searches based on their semantic similarity, enabling powerful applications in generative AI, semantic search, and more.

Creating a Vector Column in TiDB

TiDB makes it straightforward to integrate vector search capabilities into your database. Here’s a step-by-step guide to creating a table with vector columns and performing basic operations.

Step 1: Create a TiDB Serverless Cluster

To get started, create a TiDB Serverless cluster with vector support enabled:

  1. Sign up at tidbcloud.com.
  2. Select the eu-central-1 region, as vector search is currently available only in this region.
  3. Follow the tutorial to create a TiDB Serverless Cluster with vector support.

Step 2: Create a Table with a Vector Column

Once your cluster is set up, you can create a table with a vector column. Here’s an example of creating a table with a 3-dimensional vector field:

CREATE TABLE vector_table (
    id INT PRIMARY KEY,
    doc TEXT,
    embedding VECTOR(3)
);

You can then insert records into this table:

INSERT INTO vector_table VALUES 
(1, 'apple', '[1,1,1]'),
(2, 'banana', '[1,1,2]'),
(3, 'dog', '[2,2,2]');

Step 3: Perform Vector Search Queries

TiDB allows you to search for nearest neighbors based on vector similarity. For example, to find the nearest neighbors to a given vector using cosine distance, you can run:

SELECT * FROM vector_table 
ORDER BY vec_cosine_distance(embedding, '[1,1,3]') 
LIMIT 3;

This query will return the records closest to the vector [1,1,3] based on cosine similarity.

Enhancing Performance with Vector Indexes

To speed up vector search queries, you can create an HNSW (Hierarchical Navigable Small World) index, which is optimized for vector searches. Here’s how to create a table with an HNSW index:

CREATE TABLE vector_table_with_index (
    id INT PRIMARY KEY,
    doc TEXT,
    embedding VECTOR(3) COMMENT 'hnsw(distance=cosine)'
);

Advanced Usage: Integrating with AI Frameworks

TiDB’s vector search can be seamlessly integrated with various AI frameworks, enabling sophisticated applications. For instance, you can use OpenAI’s embedding models for semantic search. Here’s a Python example using the tidb-vector library:

import os
from openai import OpenAI
from peewee import Model, MySQLDatabase, TextField, SQL
from tidb_vector.peewee import VectorField

# Set up database connection
db = MySQLDatabase(
    'test',
    user=os.environ.get('TIDB_USERNAME'),
    password=os.environ.get('TIDB_PASSWORD'),
    host=os.environ.get('TIDB_HOST'),
    port=4000,
    ssl_verify_cert=True,
    ssl_verify_identity=True
)

# Define model
class DocModel(Model):
    text = TextField()
    embedding = VectorField(dimensions=1536)

    class Meta:
        database = db
        table_name = "doc_test"

# Connect to database and create table
db.connect()
db.create_tables([DocModel])

# Example documents
documents = [
    "TiDB is an open-source distributed SQL database...",
    "TiFlash is the key component...",
    "TiKV is a distributed and transactional key-value database..."
]

# Generate embeddings using OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
embeddings = [client.embeddings.create(input=doc, model="text-embedding-3-small").data[0].embedding for doc in documents]

# Insert documents into database
data_source = [{"text": doc, "embedding": emb} for doc, emb in zip(documents, embeddings)]
DocModel.insert_many(data_source).execute()

# Query for similar documents
question = "what is TiKV?"
question_embedding = client.embeddings.create(input=question, model="text-embedding-3-small").data[0].embedding
related_docs = DocModel.select(DocModel.text, DocModel.embedding.cosine_distance(question_embedding).alias("distance")).order_by(SQL("distance")).limit(3)

# Display results
for doc in related_docs:
    print(doc.distance, doc.text)

db.close()

Conclusion

TiDB’s vector search capabilities empower you to build advanced search applications by leveraging the semantic meaning of your data. Whether you’re working with text, images, or other data types, integrating vector columns in TiDB can significantly enhance your database’s search functionality.

Start exploring TiDB’s vector search today and take your applications to the next level. Sign up for TiDB Serverless at tidb.cloud.

Call to Action

Ready to unlock the potential of vector search? Get started with TiDB Serverless and experience the future of database technology. Try TiDB Serverless now.


Last updated June 26, 2024

Spin up a Serverless database with 25GiB free resources.

Start Right Away