The Rakuten Journey: Overcoming Limitations with Distributed SQL

Achieving a balance between scalability and operational efficiency has become a defining challenge for data-driven businesses. At HTAP Summit 2024, Alex Bai, Engineering Manager at Rakuten, and Tim Liu, Engineering Lead, shared their journey of overcoming the limitations of traditional databases like Apache Cassandra and MySQL to meet the demands of high-traffic platforms and critical APIs.

This blog recaps their session, diving into the real-world obstacles they encountered, the innovative solutions they deployed, and the technical benchmarks that validated distributed SQL as a robust, future-proof database solution for data-intensive applications.

The Rakuten Loyalty Ecosystem

Founded in 1997 with just six employees, Rakuten has grown into a global powerhouse, offering over 70 services across 30 countries and serving 1.8 billion members worldwide. Its U.S. offerings include:

Rakuten Rewards: An affiliate marketing service.
Rakuten Wiki: An OTT streaming platform.
Rakuten Kobo: An e-book and audiobook service.

At the heart of these services is Rakuten Points, a unifying loyalty ecosystem that fosters customer engagement across its diverse portfolio.

The Evolution of Rakuten Points

Since its launch in 2002, Rakuten Points has become Japan’s premier loyalty program, achieving:

Cumulative Points Issued: 4 trillion points by 2023, projected to reach 5 trillion in early 2024.
Transaction Volume: In 2023 alone, the platform processed 650 billion points via 8 billion transaction requests and 50 billion annual API reference calls.

These milestones have made Rakuten Points essential, enabling smooth transactions and better user engagement.

The Challenges with Legacy Systems

As the backbone of a vast digital ecosystem, Rakuten Points needed a robust and scalable database solution to meet explosive growth and engagement demands. By 2020, Rakuten’s legacy infrastructure faced mounting pressure with challenges like:

Scalability Constraints: Single-DC licensed databases struggled to handle increasing traffic, data volume, and feature complexity.
Operational Inefficiencies: Tightly coupled schemas limited flexibility, while vendor-dependent upgrades delayed progress.
High Costs: Proprietary database licensing and maintenance added significant overhead.
Downtime Risks: Updates often required service interruptions, which were unacceptable for a platform integral to financial transactions.

Transitioning to a Microservices Architecture

To overcome these limitations, Rakuten transitioned to a microservices-based architecture in 2020. The new system featured:

Core Functions Layer: Powered by Apache Cassandra, managing real-time point granting and redemption.
Aggregation Layer: Designed for advanced querying and scalability, supporting user rank calculations, point history, and accounting summaries.

This separation ensured core functionalities remained unaffected by downstream aggregation failures, improving overall resilience and performance.

Rakuten Senior Manager Mundhra Rohit delivering his session at HTAP Summit 2024.

Rakuten’s engineering team explored various solutions to support the aggregation layer’s demanding requirements:

Apache Cassandra + Apache Spark: Reliable but unsuitable for real-time APIs due to latency.
MySQL with Sharding: Developer-friendly but challenged by scaling massive datasets and ensuring high availability.

Why TiDB?

TiDB, an open-source distributed SQL database, emerged as the ideal choice for its combination of RDBMS strengths (ACID compliance, SQL support) and NoSQL scalability. Key benefits included:

Real-Time Querying: Crucial for user activity history and data aggregation.
Horizontal Scalability: Seamlessly accommodates growing data volumes.
Operational Simplicity: Reduces maintenance complexity.

Rakuten implemented a two-layer architecture:

Core System: Cassandra handles real-time transactions.
Aggregation System: TiDB powers flexible querying for user history, ranks, and accounting.

Figure 1: A diagram depicting Rakuten Pointsâ new two-layer architecture

Figure 1: A diagram depicting Rakuten Points’ new two-layer architecture

This design ensured uninterrupted core services during high-traffic events while enabling advanced querying capabilities through TiDB.

Validating TiDB’s Performance: Rakuten Group’s POC

To evaluate TiDB’s capabilities under high-demand scenarios, Rakuten conducted a rigorous Proof of Concept (POC). The testing environment closely simulated their production ecosystem, ensuring reliable and actionable results.

The POC focused on two key APIs: the Point History API and the Statistics API.

Point History API Test

This test measured TiDB’s performance in handling high-frequency writes and concurrent reads. The setup involved continually synchronizing data between Rakuten’s Cassandra cluster and TiDB through batch processing.

Simulated Scenarios:

Case 1: Writing 3 million records every 10 minutes, equivalent to 25,000 writes per second.
Case 2: Performing 5,000 queries per second (QPS) simultaneously with the writes to simulate real-world workloads.

Results:

Write Throughput: Achieved 25,000 writes per second with only a 10% drop in throughput when adding 5,000 QPS reads.
Response Times: Maintained an average response time of 17 ms for the reference API.
Resource Utilization: CPU and memory usage remained efficient, ensuring stability under the combined load.

Statistics API Test

The Statistics API is critical for providing users with historical and aggregated data insights, such as point distribution and activity summaries. The POC tested its ability to handle large-scale queries and heavy read loads.

Simulated Scenarios:

Dataset size: Over 100 million records in a single table.
User patterns: Simulated different user activity levels, from high-activity users with 15,000 records to low-activity users with 500 records.
Query load: 5,000 QPS across user groups, querying six months’ historical data.

Results:

Query Performance:
- Response Times: Delivered responses within 2 ms (80th percentile) and 30 ms (99th percentile).
- Steady Operation: TiDB cluster handled 5,000 QPS efficiently with only 17% CPU and memory utilization.
Potential for Further Optimization: TiFlash’s columnar storage showed potential for improving performance in similar workloads, and its evaluation is planned for future use.

Operational Efficiency and Monitoring

Rakuten leveraged TiDB’s built-in tools to streamline database operations:

Dashboards: Simplified slow query analysis for faster issue resolution.
Prometheus and Grafana Integration: Enabled real-time monitoring and alerting, ensuring system stability during high-traffic periods.

The POC demonstrated TiDB’s ability to handle high-write and high-read workloads simultaneously with minimal performance degradation. Whether it was managing massive historical data queries or statistical aggregation, TiDB consistently exceeded expectations.

These results validated its suitability as a core database for Rakuten’s new APIs and reinforced its role in Rakuten’s journey to modernize its data infrastructure.

Conclusion: Rakuten’s Scalability and Innovation in Action

Rakuten’s adoption of TiDB showcases how embracing modern, distributed solutions can transform data architecture to meet the demands of a fast-evolving digital ecosystem. With enhanced scalability, performance, and operational efficiency, TiDB has empowered Rakuten to deliver seamless user experiences and maintain system resilience under high demand.

Curious to dive deeper into the strategies and insights behind this transformation? Watch the full session from HTAP Summit 2024 to explore Rakuten’s journey firsthand and discover actionable takeaways for modernizing your own data infrastructure.

Watch Now