Book a Demo Start Instantly

In this tutorial, we will walk through building a Retrieval-Augmented Generation (RAG) system using the Jina.AI embeddings API and TiDB’s Vector Storage. This combination allows for efficient handling of vectorized data and powerful retrieval capabilities. We will cover the following steps:

  1. Setting up your environment.
  2. Generating embeddings using Jina.AI.
  3. Connecting to TiDB and creating a vector-enabled table.
  4. Inserting and querying data in TiDB.

Prerequisites

Before we begin, ensure you have the following:

  • Python installed on your machine.
  • Access to Jina.AI API and TiDB serverless instance.
    • Please make sure you have created a TiDB Serverless cluster with vector support enabled. Join the waitlist for the private beta at tidb.cloud/ai.
      • Sign up TiDB Cloud
      • Follow this tutorial to create a TiDB Serverless cluster with vector support enabled
      • Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page
      • Click Connect in the upper-right corner.
      • In the connection dialog, select General from the Connect With dropdown and keep the default setting of the Endpoint Type as Public.
      • If you have not set a password yet, click Create password to generate a random password.
      • Save the connection parameters to a safe place. You will need them to connect to the TiDB Serverless cluster in the following steps.
    • Generate an API token for Jina.AI embeddings API
      • Go to https://jina.ai/embeddings/
      • Scroll down to the API part, you will get a token for free with 1million tokens:
      • Save this token for future use. If you exceed the free quota of 1 million tokens, you will need this token to charge for additional usage.
Embedding API
  • Required Python libraries installed (requests, sqlalchemy, pymysql, dotenv).

Step 1: Setting Up Your Environment

First, let’s set up our environment by loading the necessary credentials and libraries.

import os
import requests
import dotenv
from sqlalchemy import Column, Integer, String, create_engine, URL
from sqlalchemy.orm import Session, declarative_base
from tidb_vector.sqlalchemy import VectorType

dotenv.load_dotenv()

JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
TIDB_USERNAME = os.getenv('TIDB_USERNAME')
TIDB_PASSWORD = os.getenv('TIDB_PASSWORD')
TIDB_HOST = os.getenv('TIDB_HOST')
TIDB_PORT = os.getenv('TIDB_PORT')
TIDB_DATABASE = os.getenv('TIDB_DATABASE')

Step 2: Generating Embeddings with Jina.AI

We will generate text embeddings using the Jina.AI API. The embeddings will be stored in TiDB for efficient querying.

Define the Function to Generate Embeddings

def generate_embeddings(text: str):
    JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
    JINAAI_HEADERS = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {JINAAI_API_KEY}'
    }
    JINAAI_REQUEST_DATA = {
        'input': [text],
        'model': 'jina-embeddings-v2-base-en'  # Dimensions: 768
    }
    response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
    return response.json()['data'][0]['embedding']

TEXTS = [
    'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
    'TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.',
]

data = []
for text in TEXTS:
    embedding = generate_embeddings(text)
    data.append({
        'text': text,
        'embedding': embedding
    })

Step 3: Connecting to TiDB and Creating a Vector-Enabled Table

Next, we’ll connect to TiDB and create a table that can store text data and their corresponding embeddings.

Connect to TiDB and Define the Table

url = URL(
    drivername="mysql+pymysql",
    username=TIDB_USERNAME,
    password=TIDB_PASSWORD,
    host=TIDB_HOST,
    port=int(TIDB_PORT),
    database=TIDB_DATABASE,
    query={"ssl_verify_cert": True, "ssl_verify_identity": True},
)
engine = create_engine(url, pool_recycle=300)
Base = declarative_base()

class Document(Base):
    __tablename__ = "jinaai_tidb_demo_documents"

    id = Column(Integer, primary_key=True)
    content = Column(String(255), nullable=False)
    content_vec = Column(
        VectorType(dim=768),
        comment="hnsw(distance=l2)"
    )
Base.metadata.create_all(engine)

Step 4: Inserting Data into TiDB

Now, we will insert the generated embeddings and their corresponding texts into the TiDB table.

Insert Data

with Session(engine) as session:
    print('- Inserting Data to TiDB...')
    for item in data:
        print(f'  - Inserting: {item["text"]}')
        session.add(Document(
            content=item['text'],
            content_vec=item['embedding']
        ))
    session.commit()

Step 5: Querying Data from TiDB

Finally, let’s query the data from TiDB to find the most relevant document based on a sample query.

Query the Data

query = 'What is TiDB?'
query_embedding = generate_embeddings(query)
with Session(engine) as session:
    print('- List All Documents and Their Distances to the Query:')
    for doc, distance in session.query(
        Document,
        Document.content_vec.cosine_distance(query_embedding).label('distance')
    ).all():
        print(f'  - {doc.content}: {distance}')

    print('- The Most Relevant Document and Its Distance to the Query:')
    doc, distance = session.query(
        Document,
        Document.content_vec.cosine_distance(query_embedding).label('distance')
    ).order_by(
        'distance'
    ).limit(1).first()
    print(f'  - {doc.content}: {distance}')

Output

When you run the above script, you should see output similar to this:

- Inserting Data to TiDB...
  - Inserting: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.
  - Inserting: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.
- List All Documents and Their Distances to the Query:
  - Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.: 0.3585317326132522
  - TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.: 0.10858658947444844
- The Most Relevant Document and Its Distance to the Query:
  - TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.: 0.10858658947444844

Conclusion

Congratulations! You have successfully built a RAG system using Jina.AI embeddings API and TiDB’s Vector Storage. This setup allows you to efficiently store and query vectorized data, enabling advanced search capabilities in your applications.

Feel free to expand this tutorial by adding more functionalities, such as handling larger datasets, optimizing queries, and integrating with other machine learning models.

Full demo: https://github.com/pingcap/tidb-vector-python/tree/main/examples/jina-ai-embeddings-demo

There are also more demos to demonstrate how to build AI apps, how to use TiDB as vector storage:


Last updated June 23, 2024

Spin up a Serverless database with 25GiB free resources.

Start Now