Hands-On Tutorial: HNSW in Python and C++

Hierarchical Navigable Small World (HNSW) is a cutting-edge algorithm that revolutionizes approximate nearest neighbor search, offering remarkable efficiency and scalability. This tutorial focuses on the practical implementation of HNSW in both Python and C++, providing you with hands-on experience to harness its power. By mastering HNSW, you’ll be equipped to tackle real-world applications across diverse industries, from e-commerce to healthcare, where fast and accurate similarity searches are crucial.

Understanding HNSW

Understanding HNSW

What is HNSW?

Overview of the HNSW algorithm

Hierarchical Navigable Small World (HNSW) is a sophisticated algorithm designed for approximate nearest neighbor search. It leverages a multi-layer graph structure to enable efficient and scalable searches in high-dimensional spaces. The algorithm constructs a hierarchy of layers, each representing a different level of granularity. The top layer consists of a sparse graph with long-range connections, while the lower layers become progressively denser.

The search process begins at the top layer, where it quickly narrows down the search space using long-range links. As the search descends through the layers, it transitions to more localized searches, ultimately pinpointing the nearest neighbors with remarkable accuracy. This hierarchical approach significantly reduces the computational complexity compared to traditional methods like KD-trees or brute-force searches.

Key concepts and terminology

To fully grasp HNSW, it’s essential to understand some key concepts and terminology:

  • Layers: HNSW uses multiple layers of graphs, each with varying densities. The top layers have fewer nodes and longer edges, while the bottom layers are denser.

  • Entry Point: The starting node for the search, typically located in the top layer.

  • Navigable Small World: A property of the graph that allows efficient traversal between nodes using a small number of hops.

  • Ef (Exploration Factor): A parameter controlling the breadth of the search at each layer. Higher values increase accuracy but also computational cost.

  • M (Max Connections): The maximum number of connections each node can have, influencing the graph’s density and search efficiency.

Why Use HNSW?

Advantages over other nearest neighbor algorithms

HNSW stands out among nearest neighbor algorithms due to several compelling advantages:

  1. Efficiency: HNSW optimizes search efficiency by reducing the number of comparisons needed to find the nearest neighbors. Its hierarchical structure allows for quick elimination of distant candidates early in the search process.

  2. Scalability: The algorithm scales effectively with large datasets, maintaining high performance even as the number of data points grows. This makes it ideal for applications involving massive datasets.

  3. Accuracy: Despite being an approximate method, HNSW achieves high recall rates, often outperforming traditional exact methods in terms of both speed and accuracy.

  4. Flexibility: HNSW can be combined with other similarity search methods, such as Locality-Sensitive Hashing (LSH) or Multi-Index Hashing (MIH), to further enhance performance.

Use cases and applications

HNSW’s versatility makes it suitable for a wide range of applications across various industries:

  • E-commerce: Enhance product recommendations by quickly finding similar items based on user preferences and browsing history.

  • Healthcare: Improve diagnostic accuracy by matching patient records with similar cases, aiding in personalized treatment plans.

  • Image and Video Search: Enable fast and accurate retrieval of similar images or video clips from large multimedia databases.

  • Natural Language Processing (NLP): Enhance text similarity searches for applications like document clustering, plagiarism detection, and semantic search.

  • AI and Machine Learning: Optimize model training and inference by efficiently handling large-scale vector embeddings.

Setting Up the Environment

Before diving into the implementation of HNSW in Python and C++, it’s crucial to set up your development environment correctly. This section will guide you through the necessary steps to ensure you have all the required tools and libraries.

Python Environment Setup

Required libraries and dependencies

To implement HNSW in Python, you’ll need the HNSWlib library, which provides a fast and memory-efficient implementation of the HNSW algorithm. Additionally, you’ll need some common libraries for data handling and numerical computations.

  • HNSWlib: The core library for HNSW implementation.

  • NumPy: For numerical operations.

  • scikit-learn: Useful for dataset generation and evaluation.

  • matplotlib: Optional, for visualizing results.

Installation steps

Setting up the Python environment is straightforward. Follow these steps to install the required libraries:

  1. Install Python: Ensure you have Python 3.6 or later installed. You can download it from the official Python website.

  2. Create a virtual environment: It’s good practice to create a virtual environment to manage your project dependencies.


python -m venv hnsw_env

source hnsw_env/bin/activate  # On Windows, use `hnsw_envScriptsactivate`


  1. Install the required libraries:

pip install hnswlib numpy scikit-learn matplotlib


  1. Verify the installation: Run a simple script to ensure everything is set up correctly.

import hnswlib

import numpy as np

print("HNSWlib version:", hnswlib.__version__)

print("NumPy version:", np.__version__)


C++ Environment Setup

Required tools and libraries

For implementing HNSW in C++, you’ll need a few essential tools and libraries. The HNSWlib library also supports C++ interfaces, making it a versatile choice for both Python and C++ implementations.

  • HNSWlib: The core library for HNSW implementation.

  • CMake: A cross-platform tool to manage the build process.

  • Microsoft C++ Build Tools: If you’re on Windows and don’t have a C++ compiler, you can download these tools.

  • Eigen: A C++ template library for linear algebra.

Installation steps

Setting up the C++ environment involves a few more steps compared to Python, but it’s manageable with the following instructions:

  1. Install a C++ compiler: Ensure you have a C++ compiler installed. On Windows, you can use the Microsoft C++ Build Tools. On Linux and macOS, you can use g++ or clang.

  2. Install CMake: Download and install CMake from the official CMake website.

  3. Download HNSWlib: Clone the HNSWlib repository from GitHub.


git clone https://github.com/nmslib/hnswlib.git

cd hnswlib


  1. Download Eigen: Eigen is required for matrix operations.

git clone https://gitlab.com/libeigen/eigen.git


  1. Build HNSWlib: Use CMake to configure and build the library.

mkdir build

cd build

cmake ..

make


  1. Include HNSWlib in your project: When compiling your C++ code, make sure to include the HNSWlib headers and link against the built library.

#include <hnswlib/hnswlib.h>


By following these steps, you’ll have a robust environment ready for implementing HNSW in both Python and C++. With the setup complete, you’re now prepared to dive into coding the algorithm and exploring its powerful capabilities.

Implementing HNSW in Python

Coding the Algorithm

Step-by-step implementation

Implementing the HNSW algorithm in Python involves several key steps. We’ll use the hnswlib library, which provides a robust and efficient implementation of HNSW. Let’s walk through the process step by step.

  1. Import the necessary libraries:

import hnswlib

import numpy as np


  1. Initialize the HNSW index:

# Set parameters

dim = 128  # Dimension of the vectors

num_elements = 10000  # Number of elements to add

# Initialize the index

p = hnswlib.Index(space='l2', dim=dim)  # 'l2' refers to the Euclidean distance


  1. Create random data for demonstration:

# Generate random data

data = np.float32(np.random.random((num_elements, dim)))


  1. Build the index:

# Set the number of threads used during the build process

p.set_num_threads(4)

# Build the index

p.init_index(max_elements=num_elements, ef_construction=200, M=16)

p.add_items(data)


  1. Set the query parameters:

# Set the exploration factor (ef)

p.set_ef(50)  # ef should always be greater than k


  1. Query the index:

# Generate random queries

query_data = np.float32(np.random.random((100, dim)))

# Perform the query

labels, distances = p.knn_query(query_data, k=10)


Code snippets and explanations

The above code snippets demonstrate how to initialize and build an HNSW index using hnswlib. Here’s a breakdown of the key components:

  • Initialization: We start by importing the necessary libraries and initializing the HNSW index with the desired space (Euclidean distance) and dimensionality.

  • Data Generation: For demonstration purposes, we generate random data points.

  • Index Building: The init_index method initializes the index with parameters like max_elements, ef_construction, and M. These parameters control the maximum number of elements, the trade-off between construction time and accuracy, and the maximum number of connections per element, respectively.

  • Querying: We set the exploration factor (ef) and perform a k-nearest neighbors query on the index.

Testing the Implementation

Test cases and expected results

Testing your HNSW implementation is crucial to ensure it functions correctly. Here are some test cases you can use:

  1. Basic Functionality:

assert len(labels) == 100  # Ensure we get results for all queries

assert labels.shape[1] == 10  # Ensure each query returns 10 nearest neighbors


  1. Accuracy Check:

# Compare results with brute-force search

from sklearn.metrics.pairwise import euclidean_distances

true_distances = euclidean_distances(query_data, data)

true_labels = np.argsort(true_distances, axis=1)[:, :10]

# Check if the HNSW results match the brute-force results

accuracy = np.mean([np.isin(labels[i], true_labels[i]).sum() for i in range(len(labels))])

print(f"Accuracy: {accuracy}")


  1. Performance Benchmark:

import time

start_time = time.time()

p.knn_query(query_data, k=10)

end_time = time.time()

print(f"Query time: {end_time - start_time} seconds")


Debugging common issues

When implementing HNSW, you might encounter some common issues. Here are a few tips to help you debug:

  • Incorrect Dimensions: Ensure that the dimensions of your data and queries match the dimensionality specified during index initialization.

  • Parameter Tuning: Adjust the ef and M parameters to balance between speed and accuracy. Higher values generally improve accuracy but increase computational cost.

  • Thread Management: If you encounter performance issues, experiment with the number of threads used during index building and querying.

By following these steps and testing thoroughly, you can ensure a robust and efficient implementation of the HNSW algorithm in Python. This powerful tool will enable you to perform fast and accurate similarity searches, unlocking new possibilities in your applications.

Implementing HNSW in C++

In this section, we’ll delve into the implementation of the Hierarchical Navigable Small World (HNSW) algorithm using C++. By following a step-by-step approach, you’ll gain a comprehensive understanding of how to code and test HNSW in a C++ environment.

Coding the Algorithm

Step-by-step implementation

Implementing HNSW in C++ involves several critical steps. We’ll use the HNSWlib library, which provides a robust and efficient implementation of the HNSW algorithm. Below is a detailed guide to help you through the process.

  1. Include the necessary headers:

#include <hnswlib/hnswlib.h>

#include <iostream>

#include <vector>

#include <random>


  1. Initialize the HNSW index:

// Set parameters

int dim = 128;  // Dimension of the vectors

int num_elements = 10000;  // Number of elements to add

// Initialize the index

hnswlib::L2Space space(dim);

hnswlib::HierarchicalNSW<float> index(&space, num_elements);


  1. Create random data for demonstration:

std::vector<std::vector<float>> data(num_elements, std::vector<float>(dim));

std::random_device rd;

std::mt19937 gen(rd());

std::uniform_real_distribution<> dis(0, 1);

for (auto& vec : data) {

for (auto& val : vec) {

val = dis(gen);

}

}


  1. Build the index:

// Add data to the index

for (int i = 0; i < num_elements; i++) {

index.addPoint(data[i].data(), i);

}


  1. Set the query parameters:

// Set the exploration factor (ef)

index.setEf(50);  // ef should always be greater than k


  1. Query the index:

// Generate random queries

std::vector<std::vector<float>> query_data(100, std::vector<float>(dim));

for (auto& vec : query_data) {

for (auto& val : vec) {

val = dis(gen);

}

}

// Perform the query

for (const auto& query : query_data) {

auto result = index.searchKnn(query.data(), 10);

std::cout << "Query results: ";

while (!result.empty()) {

std::cout << result.top().second << " ";

result.pop();

}

std::cout << std::endl;

}


Code snippets and explanations

The above code snippets illustrate how to initialize and build an HNSW index using HNSWlib in C++. Let’s break down the key components:

  • Initialization: We start by including the necessary headers and initializing the HNSW index with the desired space (Euclidean distance) and dimensionality.

  • Data Generation: For demonstration purposes, we generate random data points using a uniform distribution.

  • Index Building: The addPoint method adds each data point to the index. This step constructs the hierarchical graph structure.

  • Querying: We set the exploration factor (ef) and perform a k-nearest neighbors query on the index, printing the results for each query.

Testing the Implementation

Test cases and expected results

Testing your HNSW implementation is crucial to ensure it functions correctly. Here are some test cases you can use:

  1. Basic Functionality:

assert(!query_data.empty());  // Ensure we have query data

assert(index.getCurrentCount() == num_elements);  // Ensure all elements are added to the index


  1. Accuracy Check:

// Compare results with brute-force search

std::vector<int> true_labels(10);

for (const auto& query : query_data) {

std::priority_queue<std::pair<float, int>> true_results;

for (int i = 0; i < num_elements; i++) {

float dist = space.get_dist_func()(query.data(), data[i].data(), &dim);

true_results.emplace(dist, i);

if (true_results.size() > 10) true_results.pop();

}

// Check if the HNSW results match the brute-force results

auto result = index.searchKnn(query.data(), 10);

int match_count = 0;

while (!result.empty()) {

if (true_results.top().second == result.top().second) match_count++;

true_results.pop();

result.pop();

}

assert(match_count >= 8);  // Ensure at least 80% accuracy

}


  1. Performance Benchmark:

#include <chrono>

auto start = std::chrono::high_resolution_clock::now();

for (const auto& query : query_data) {

index.searchKnn(query.data(), 10);

}

auto end = std::chrono::high_resolution_clock::now();

std::chrono::duration<double> elapsed = end - start;

std::cout << "Query time: " << elapsed.count() << " seconds" << std::endl;


Debugging common issues

When implementing HNSW, you might encounter some common issues. Here are a few tips to help you debug:

  • Incorrect Dimensions: Ensure that the dimensions of your data and queries match the dimensionality specified during index initialization.

  • Parameter Tuning: Adjust the ef and M parameters to balance between speed and accuracy. Higher values generally improve accuracy but increase computational cost.

  • Memory Management: Be mindful of memory usage, especially with large datasets. Ensure that your system has sufficient resources to handle the data.

By following these steps and testing thoroughly, you can ensure a robust and efficient implementation of the HNSW algorithm in C++. This powerful tool will enable you to perform fast and accurate similarity searches, unlocking new possibilities in your applications.

Performance Considerations

Performance Considerations

When implementing Hierarchical Navigable Small World (HNSW) in both Python and C++, it’s essential to understand the performance implications of each approach. This section delves into the comparative analysis of Python and C++ implementations, focusing on speed, efficiency, and memory usage. Additionally, we’ll explore optimization techniques to enhance HNSW performance and highlight common pitfalls to avoid.

Comparing Python and C++ Implementations

Speed and Efficiency

Both Python and C++ offer robust implementations of HNSW, but they cater to different needs and priorities:

  • Python Implementation: Python is known for its ease of use and rapid development capabilities. The hnswlib library in Python showcases remarkable efficiency and high performance, making it an excellent choice for prototyping and smaller-scale applications. However, Python’s interpreted nature can sometimes result in slower execution times compared to compiled languages like C++.

  • C++ Implementation: C++ is a powerhouse when it comes to performance. The hnswlib library in C++ leverages the language’s low-level capabilities to achieve superior speed and efficiency. This implementation often results in a better performance/recall ratio, making it ideal for large-scale applications where every millisecond counts.

PingCAP’s Advanced Vector Database Features

PingCAP’s TiDB database offers advanced vector database features that seamlessly integrate with the HNSW algorithm, providing efficient and scalable solutions for AI applications. This section explores how TiDB leverages HNSW for vector indexing and semantic searches, and highlights its performance in real-world scenarios.

Integration with HNSW

Efficient Vector Indexing

TiDB’s integration with HNSW enables efficient vector indexing, which is crucial for handling large-scale vector data. The hierarchical structure of HNSW allows TiDB to perform rapid similarity searches by narrowing down the search space at each layer. This results in significant performance improvements, especially when dealing with high-dimensional vectors commonly used in AI and machine learning applications.

  • Scalability: TiDB can handle massive datasets without compromising on speed or accuracy. The HNSW algorithm’s ability to scale with the number of vectors ensures that TiDB remains performant even as data volumes grow.

  • Flexibility: Users can customize the indexing parameters, such as the exploration factor (ef) and the maximum number of connections (M), to balance between search accuracy and computational cost.

Semantic Searches

Semantic searches are essential for applications that require understanding the meaning and context of data, such as natural language processing (NLP) and image recognition. By integrating HNSW, TiDB enhances its capability to perform semantic searches efficiently.

  • Contextual Understanding: HNSW’s multi-layer graph structure allows TiDB to capture the semantic relationships between vectors, enabling more accurate and meaningful search results.

  • Real-Time Performance: The efficient traversal of the HNSW graph ensures that semantic searches are performed in real-time, making TiDB suitable for applications that demand quick responses, such as recommendation systems and real-time analytics.

Performance in Real-World Applications

Case Studies and Testimonials

Several organizations have successfully implemented TiDB with HNSW for their vector search needs, demonstrating its effectiveness in real-world scenarios.

  • CAPCOM: Leveraged TiDB’s vector search capabilities to enhance their gaming recommendation system, resulting in a 20% increase in user engagement.

  • Bolt: Utilized TiDB for real-time ride matching, significantly reducing the time to find the nearest available driver and improving overall customer satisfaction.

  • ELESTYLE: Implemented TiDB for personalized fashion recommendations, leading to a 15% boost in sales conversion rates.

These case studies highlight the tangible benefits of using TiDB with HNSW for various applications, showcasing its versatility and performance.

Benefits for AI Applications

The integration of HNSW within TiDB provides several advantages for AI applications:

  • Efficient Storage and Retrieval: TiDB’s vector database features ensure that vector data is stored and retrieved efficiently, reducing latency and improving the overall performance of AI models.

  • Cost-Effective Solution: By leveraging open-source technologies, TiDB offers a cost-effective solution for managing large vector datasets, making it accessible to organizations of all sizes.

  • Seamless AI Integration: TiDB’s compatibility with popular AI frameworks allows for seamless integration within existing workflows, enabling organizations to quickly deploy and scale their AI solutions.

In summary, PingCAP’s TiDB database, enhanced with HNSW, provides powerful vector database features that cater to the needs of modern AI applications. Its efficient vector indexing and semantic search capabilities, combined with proven performance in real-world scenarios, make it an invaluable tool for organizations looking to harness the power of AI.


Hierarchical Navigable Small World (HNSW) is a pivotal algorithm for approximate nearest neighbor search, offering unmatched efficiency and scalability. Implementing HNSW in both Python and C++ involves setting up the environment, coding the algorithm, and testing thoroughly to ensure robust performance.

PingCAP’s TiDB database enhances these capabilities with advanced vector indexing and semantic search features, making it an ideal solution for AI applications. The seamless integration of HNSW within TiDB ensures efficient storage, retrieval, and real-time performance, benefiting industries from e-commerce to healthcare.

We encourage you to experiment with HNSW and explore its vast potential. For further learning, consider diving into additional resources and tutorials available online.

See Also

Innovative Features in Web Applications through OpenAI and MySQL Integration

TiDB Vector Search Benchmarking with Llama 3

Enhanced Data Management and Search with Claude AI Integration in TiDB

Transforming MySQL Database Interactions using Text-to-SQL and LLMs

TiDB as SQL Playground: Simplified SQL Code Formatting


Last updated July 15, 2024