Build RAG with Jina.AI Embeddings API and TiDB Vector Storage

In this tutorial, we will walk through building a Retrieval-Augmented Generation (RAG) system using the Jina.AI embeddings API and TiDB’s Vector Storage. This combination allows for efficient handling of vectorized data and powerful retrieval capabilities. We will cover the following steps:

Setting up your environment.
Generating embeddings using Jina.AI.
Connecting to TiDB and creating a vector-enabled table.
Inserting and querying data in TiDB.

Prerequisites

Before we begin, ensure you have the following:

Python installed on your machine.
Access to Jina.AI API and TiDB serverless instance.
- Please make sure you have created a TiDB Serverless cluster with vector support enabled. Join the waitlist for the private beta at tidb.cloud/ai.
  - Sign up TiDB Cloud
  - Follow this tutorial to create a TiDB Serverless cluster with vector support enabled
  - Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page
  - Click Connect in the upper-right corner.
  - In the connection dialog, select General from the Connect With dropdown and keep the default setting of the Endpoint Type as Public.
  - If you have not set a password yet, click Create password to generate a random password.
  - Save the connection parameters to a safe place. You will need them to connect to the TiDB Serverless cluster in the following steps.
- Generate an API token for Jina.AI embeddings API
  - Go to https://jina.ai/embeddings/
  - Scroll down to the API part, you will get a token for free with 1million tokens:
  - Save this token for future use. If you exceed the free quota of 1 million tokens, you will need this token to charge for additional usage.

Required Python libraries installed (requests, sqlalchemy, pymysql, dotenv).

Step 1: Setting Up Your Environment

First, let’s set up our environment by loading the necessary credentials and libraries.

import os
import requests
import dotenv
from sqlalchemy import Column, Integer, String, create_engine, URL
from sqlalchemy.orm import Session, declarative_base
from tidb_vector.sqlalchemy import VectorType

dotenv.load_dotenv()

JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
TIDB_USERNAME = os.getenv('TIDB_USERNAME')
TIDB_PASSWORD = os.getenv('TIDB_PASSWORD')
TIDB_HOST = os.getenv('TIDB_HOST')
TIDB_PORT = os.getenv('TIDB_PORT')
TIDB_DATABASE = os.getenv('TIDB_DATABASE')

Step 2: Generating Embeddings with Jina.AI

We will generate text embeddings using the Jina.AI API. The embeddings will be stored in TiDB for efficient querying.

Define the Function to Generate Embeddings

def generate_embeddings(text: str):
    JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
    JINAAI_HEADERS = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {JINAAI_API_KEY}'
    }
    JINAAI_REQUEST_DATA = {
        'input': [text],
        'model': 'jina-embeddings-v2-base-en'  # Dimensions: 768
    }
    response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
    return response.json()['data'][0]['embedding']

TEXTS = [
    'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
    'TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.',
]

data = []
for text in TEXTS:
    embedding = generate_embeddings(text)
    data.append({
        'text': text,
        'embedding': embedding
    })

Step 3: Connecting to TiDB and Creating a Vector-Enabled Table

Next, we’ll connect to TiDB and create a table that can store text data and their corresponding embeddings.

Connect to TiDB and Define the Table

url = URL(
    drivername="mysql+pymysql",
    username=TIDB_USERNAME,
    password=TIDB_PASSWORD,
    host=TIDB_HOST,
    port=int(TIDB_PORT),
    database=TIDB_DATABASE,
    query={"ssl_verify_cert": True, "ssl_verify_identity": True},
)
engine = create_engine(url, pool_recycle=300)
Base = declarative_base()

class Document(Base):
    __tablename__ = "jinaai_tidb_demo_documents"

    id = Column(Integer, primary_key=True)
    content = Column(String(255), nullable=False)
    content_vec = Column(
        VectorType(dim=768),
        comment="hnsw(distance=l2)"
    )
Base.metadata.create_all(engine)

Step 4: Inserting Data into TiDB

Now, we will insert the generated embeddings and their corresponding texts into the TiDB table.

Insert Data

with Session(engine) as session:
    print('- Inserting Data to TiDB...')
    for item in data:
        print(f'  - Inserting: {item["text"]}')
        session.add(Document(
            content=item['text'],
            content_vec=item['embedding']
        ))
    session.commit()

Step 5: Querying Data from TiDB

Finally, let’s query the data from TiDB to find the most relevant document based on a sample query.

Query the Data

query = 'What is TiDB?'
query_embedding = generate_embeddings(query)
with Session(engine) as session:
    print('- List All Documents and Their Distances to the Query:')
    for doc, distance in session.query(
        Document,
        Document.content_vec.cosine_distance(query_embedding).label('distance')
    ).all():
        print(f'  - {doc.content}: {distance}')

    print('- The Most Relevant Document and Its Distance to the Query:')
    doc, distance = session.query(
        Document,
        Document.content_vec.cosine_distance(query_embedding).label('distance')
    ).order_by(
        'distance'
    ).limit(1).first()
    print(f'  - {doc.content}: {distance}')

Output

When you run the above script, you should see output similar to this:

- Inserting Data to TiDB...
  - Inserting: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.
  - Inserting: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.
- List All Documents and Their Distances to the Query:
  - Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.: 0.3585317326132522
  - TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.: 0.10858658947444844
- The Most Relevant Document and Its Distance to the Query:
  - TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.: 0.10858658947444844

Conclusion

Congratulations! You have successfully built a RAG system using Jina.AI embeddings API and TiDB’s Vector Storage. This setup allows you to efficiently store and query vectorized data, enabling advanced search capabilities in your applications.

Feel free to expand this tutorial by adding more functionalities, such as handling larger datasets, optimizing queries, and integrating with other machine learning models.

Full demo: https://github.com/pingcap/tidb-vector-python/tree/main/examples/jina-ai-embeddings-demo

There are also more demos to demonstrate how to build AI apps, how to use TiDB as vector storage:

OpenAI Embedding: use the OpenAI embedding model to generate vectors for text data.
Image Search: use the OpenAI CLIP model to generate vectors for image and text.
LlamaIndex RAG with UI: use the LlamaIndex to build an RAG(Retrieval-Augmented Generation) application.
Chat with URL: use LlamaIndex to build an RAG(Retrieval-Augmented Generation) application that can chat with a URL.
GraphRAG: 20 lines code of using TiDB Serverless to build a Knowledge Graph based RAG application.
GraphRAG Step by Step Tutorial: Step by step tutorial to build a Knowledge Graph based RAG application with Colab notebook. In this tutorial, you will learn how to extract knowledge from a text corpus, build a Knowledge Graph, store the Knowledge Graph in TiDB Serverless, and search from the Knowledge Graph.
Vector Search Notebook with SQLAlchemy: use SQLAlchemy to interact with TiDB Serverless: connect db, index&store data and then search vectors.

Last updated June 23, 2024

Table of Contents

Spin up a Serverless database with 25GiB free resources.

Start Right Away