AI-Powered Search with TiDB Vector

Introduction

The evolution of AI has brought significant advancements in search technologies. Traditional keyword-based search is being increasingly replaced by AI-powered search, which leverages machine learning models to understand the semantic meaning of queries and data. TiDB Vector, a feature of TiDB, offers a robust solution for implementing AI-powered search, enabling semantic search and similarity search for various data types such as text, images, and more.

What is TiDB Vector Search?

TiDB Vector Search is a powerful tool that allows you to perform searches based on the semantic meaning of data rather than just keywords. This is achieved by converting data into vector embeddings, which are then used to measure the similarity between different pieces of data.

Key Features:

Semantic Search: Find data that has a similar meaning to your query.
Similarity Search: Identify similar items in large datasets, useful for recommendations and content discovery.
Versatile Data Support: Works with text, images, videos, audios, and more.

How TiDB Vector Search Works

TiDB Vector Search involves the following steps:

Embedding Generation: Convert data into vectors using embeddings. Embeddings represent data points in a multi-dimensional space.
Vector Storage: Store these vectors in TiDB, allowing for efficient querying and retrieval.
Similarity Search: Use vector distance metrics to find the nearest neighbors (most similar data points) to a query vector.

Setting Up TiDB Vector Search

To get started with TiDB Vector Search, follow these steps:

Create a TiDB Serverless Cluster

Sign Up:Join TiDB Cloud.
Select Region: Choose the eu-central-1 region (currently supports vector search).
Create Cluster: Follow the quickstart guide to set up your cluster.

Enable Vector Search

If the vector search feature is not visible, contact the support team at xin.shi@pingcap.com to enable it for your account.

Connect to TiDB

Navigate to Clusters: Go to the Clusters page and select your cluster.
Connect: Click on “Connect” and select “General” from the dropdown. Keep the endpoint type as Public.
Set Password: If not already set, create a password for your cluster.

Example: Semantic Search with OpenAI and TiDB

Here’s a practical example of using OpenAI’s embeddings for semantic search with TiDB Vector.

Prerequisites:

Python >= 3.6
TiDB Serverless Cluster with vector support

Setting Up Environment

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install openai peewee pymysql tidb_vector

Example Code

import os
from openai import OpenAI
from peewee import Model, MySQLDatabase, TextField
from tidb_vector.peewee import VectorField

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
embedding_model = "text-embedding-ada-002"
embedding_dimensions = 1536

# Connect to TiDB
db = MySQLDatabase(
    'test',
    user=os.environ.get('TIDB_USERNAME'),
    password=os.environ.get('TIDB_PASSWORD'),
    host=os.environ.get('TIDB_HOST'),
    port=4000,
    ssl_verify_cert=True,
    ssl_verify_identity=True
)

# Sample documents
documents = [
    "TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.",
    "TiFlash is the key component that makes TiDB essentially an HTAP database.",
    "TiKV is a distributed and transactional key-value database, providing transactional APIs with ACID compliance."
]

# Define model
class DocModel(Model):
    text = TextField()
    embedding = VectorField(dimensions=embedding_dimensions)

    class Meta:
        database = db
        table_name = "doc_test"

# Setup database
db.connect()
db.drop_tables([DocModel])
db.create_tables([DocModel])

# Generate embeddings
embeddings = [r.embedding for r in client.embeddings.create(input=documents, model=embedding_model).data]
data_source = [{"text": doc, "embedding": emb} for doc, emb in zip(documents, embeddings)]
DocModel.insert_many(data_source).execute()

# Query example
question = "What is TiKV?"
question_embedding = client.embeddings.create(input=question, model=embedding_model).data[0].embedding
related_docs = DocModel.select(DocModel.text, DocModel.embedding.cosine_distance(question_embedding).alias("distance")).order_by(SQL("distance")).limit(3)

print("Question:", question)
print("Related documents:")
for doc in related_docs:
    print(doc.distance, doc.text)

db.close()

Conclusion

TiDB Vector Search provides a powerful platform for building AI-powered search applications. By leveraging vector embeddings and similarity search, you can implement advanced search capabilities that go beyond traditional keyword-based methods. Whether you’re dealing with text, images, or other types of data, TiDB Vector Search can help you unlock new possibilities for your applications.

Call to Action

Ready to explore TiDB Serverless and build your own AI-powered search applications? Get started with TiDB and discover the power of semantic search today!

Last updated June 26, 2024

Table of Contents

Spin up a Serverless database with 25GiB free resources.

Start Right Away