tidb_feature_1800x600 (1)

It’s that time of year again for a look back at the database industry. 2024 wasn’t so much about groundbreaking database technologies, but rather the explosive growth of AI-plus-data applications. That said, core database tech hasn’t remained idle either, as cloud infrastructure became the de facto standard for deployment. 

So, without further adieu, here are some of my observations from the past year along with where I think the industry is heading in 2025.

1. AI Application Competitive Advantage will Come from the Data and How It’s Retrieved

It’s been two years since ChatGPT stunned the world. However, the developer community now seems divided on GenAI across two camps: 

  1. The devoted believers in the Scaling Law, pushing for ever-stronger models and aiming for AGI.
  2. The pragmatists, who accept current LLM capabilities and are figuring out how to use data and agents to build practical stuff. 

Here’s what’s fascinating: When everyone has access to the same models (whether it’s GPT-4 or open-source alternatives like Llama 3), what really differentiates one AI application from another? It’s the data. Your proprietary data, your business knowledge, and how effectively you can leverage them. This creates genuine competitive advantage.

That’s not surprising, given LLMs’ current limitations:

  • Limited context windows make it impossible to feed entire knowledge bases into LLMs for each query.
  • The attention mechanism inherent in Transformers can lead to hallucinations, requiring additional verification mechanisms.

In the early days, fine-tuning was the go-to solution for domain-specific tasks. But fine-tuning is like tuning a musical instrument – it’s hard to get the exact note right. How much data do you need? How do you ensure precise control over specific responses? Not to mention the expensive hardware and days of training time required, making iteration prohibitively time-consuming.

Fortunately, base models (especially open source ones) have made remarkable progress this year. Models like Llama 3, Qwen, and Mistral have surpassed GPT-3.5 and are approaching GPT-4 in capabilities. With context windows expanding to 32,000 tokens and beyond, RAG has become increasingly compelling.

The Continued Evolution of Retrieval-Augmented Generation (RAG)

RAG itself has evolved significantly – from simple vector search + LLM combinations to sophisticated approaches like GraphRAG and dynamic section retrieval. The core principle remains the same: For any given question, retrieve the most relevant context from your database.

The quality of that retrieved context drives the quality of RAG’s answer. The LLM is mainly just a reading comprehension engine here. The real key is the data quality and retrieval precision. This means that RAG is more of a data processing/database play than an AI one.

I’ve been saying since last year that vector databases are a curious phenomenon. From a database developer’s perspective, vectors are just another index type – what you really need is to retrieve your data based on vector similarity. Traditional databases didn’t have vector indexes because there wasn’t much demand, until AI workloads exploded. Vector search became a crucial requirement, and here came the “vector database” workaround.

The real challenge isn’t vector search – it’s building a robust, unified database system.

Unified data solutions are gaining traction because vector search alone isn’t enough. As RAG practices become more sophisticated, people realize that just retrieving embeddings doesn’t give you good enough recall or accuracy. You need other retrieval methods, too. 

2. S3 will Become the New Disk, and Every Modern Database will Need to Be S3 Native

The transformation of data infrastructure isn’t just about AI capabilities – it’s about fundamentally rethinking storage architecture. Since we started building TiDB Cloud Serverless three years ago, Amazon S3’s elegant simplicity and scalability has consistently impressed us. What began as a hypothesis – “S3 is the new disk” – has evolved into an industry consensus. 

S3 is not just an AWS product line, but a key infrastructure component for society as it’s “too big to fail.” Most of the new databases built over the last couple of years have been based on S3: NeonDB, RockSet, Supabase, Databend, and TiDB Cloud Serverless. S3 is also the go-to storage for various data lake implementations. Many leading infrastructure companies are starting to acquire startups in their respective fields that are building on S3 – Confluent’s acquisition of WarpStream is a good recent example.

Why is S3 becoming the foundation for modern databases? It has four key advantages:

  • True elasticity: Scale from zero to petabytes without infrastructure changes
  • Cost efficiency: Pay only for actual storage used, with automatic tiering
  • Linear throughput scaling: S3 Express One Zone delivers consistent sub-millisecond latency
  • 99.999999999% durability: Essentially eliminating data loss concerns

AWS’s 2024 S3 enhancements have made it an even more compelling offering:

  • S3 Metadata: Enhanced querying capabilities directly from object metadata
  • Conditional writes: Enabling atomic operations crucial for database consistency

The second enhancement means S3 now supports safe concurrent writes, an important capability for building databases. There are many more improvements like this. AWS is making it easier to build complex distributed databases. A lot of code in distributed databases deals with the state of the underlying storage, especially in exception cases (like distribution, data replication, and recovery). Using S3 as an always-on, scalable foundation really simplifies the logic of the upper layers.

3. Databases will Become “Database Services”, with Distributed Databases as the Foundation

It’s fascinating to observe how databases are transforming from mere storage systems into comprehensive services. This shift isn’t just a technical evolution – it’s fundamentally changing how we think about data infrastructure.

Looking at TiDB Cloud’s journey over the past two years, we’ve seen a tenfold increase in cloud clusters. This growth tells an interesting story: Except for highly-regulated sectors like banking, organizations are increasingly comfortable with cloud-native database services. The question isn’t whether to move to the cloud anymore – it’s how to do it most effectively.

A chart showing TiDB Cloud’s 10x cluster growth since the start of 2023.

A chart showing TiDB Cloud’s 10x cluster growth since the start of 2023.

From a user’s perspective, the appeal is simple: Focus on your application while the platform handles the complexity. But for vendors, building a cloud service brings challenges that traditional database development never had to consider. Take TiDB Cloud’s architecture: We’ve had to tackle everything from billing systems to tenant management, from metadata handling to comprehensive observability. It’s a whole new game.

2025 database trends: A detailed overview of the complexity TiDB Cloud’s architecture handles under the hood for users, as distributed databases become the foundation for

A detailed overview of the complexity TiDB Cloud’s architecture handles under the hood for users.

The cloud platform model brings distinct advantages. Standardized environments make management more predictable, and automation becomes more reliable. We can build sophisticated systems for fault analysis and handling that would be impossible in traditional on-premises deployments. We’re even exploring how GenAI could enhance these capabilities further.

Multi-Cloud Deployments and Managing Data Complexity at Scale

With that said, multi-cloud deployments present unique challenges. While we’ve worked hard to minimize platform dependencies, the differences between cloud providers run deep. Even seemingly similar capabilities – like object storage or distributed block devices – perform differently across platforms. Security models vary significantly too. But this is precisely where independent database vendors can demonstrate their value.

The most interesting challenges we’re seeing aren’t just about handling more data – they’re about managing complexity at scale. When customers ask if we can support millions of tables and databases, it’s not just about storage capacity. It’s about maintaining performance, ensuring observability, and providing efficient operations at that scale.

The future belongs to platforms that can deliver not just storage, but comprehensive data services. As one cloud executive recently noted, “The winners will be those who can manage complexity at scale while maintaining simplicity for users.”

4. SQL will Remain King of Data Processing, with Python Becoming Queen

This shift toward simplicity isn’t about developers becoming less capable; it’s about focusing on what truly matters. Python, which I once viewed with some skepticism, has proven remarkably effective for building AI applications. When your real bottleneck is AI model performance, Python’s supposed inefficiencies become irrelevant. What matters more is its rich ecosystem, AI framework compatibility, and developer productivity benefits.

Modern Python has evolved thoughtfully. Type hinting, though optional, provides just enough structure to keep medium-sized applications maintainable. Package management has matured significantly, addressing many of the traditional pain points in Python development.

Perhaps the most profound change is happening in how we interact with data. The days of hand-crafting SQL queries and managing database connections are giving way to higher-level abstractions. Frameworks like Pydantic are becoming the standard, while ORMs handle the database complexity behind the scenes.

This evolution makes perfect sense for AI applications. Developers today care about data utility, not data storage. They want to focus on building compelling user experiences, implementing business logic, and optimizing AI model interactions, rather than wrestling with database internals.

The industry is responding with increasingly sophisticated data APIs. We’re seeing specialized endpoints emerge across different domains. Legal document retrieval comes with built-in compliance optimization. Medical data APIs incorporate privacy controls by default. Financial services endpoints deliver real-time analysis alongside raw data.

These aren’t just simple CRUD operations. Modern data APIs are becoming increasingly intelligent, offering context-aware data retrieval, automated optimization, and even integrated inference results. The complexity is abstracted away, but the power remains accessible when needed.

This trend toward simplification and abstraction isn’t about dumbing things down – it’s about smart defaults and purposeful constraints that help developers build better applications faster. The future belongs to platforms that can hide complexity while maintaining flexibility where it matters.

Closing Thoughts

As I wrap up this reflection on the evolution of developer experience and data infrastructure, I can’t help but feel a sense of perspective. Writing this in 2025, as we mark TiDB’s tenth anniversary, I’m struck by how far we’ve come – not just as a company, but as an industry.

In the software infrastructure business, we’ve come to understand that this is a marathon, not a sprint. The current landscape often pushes us toward quick wins and rapid results, but the real innovations – the ones that truly transform how developers work – require a longer view. Detours and mistakes aren’t just inevitable; they’re essential parts of the journey. That’s the responsibility we carry as we build for the future.

Looking ahead, I maintain a deep conviction: Simplicity will triumph over complexity, innovative approaches will supersede the outdated, and elegant solutions will prevail over the convoluted. In this era of AI and cloud computing, this principle feels more relevant than ever.

If you want to explore TiDB on your own terms, sign up for a free trial of TiDB Cloud Serverless and see how easy it is to create and connect to an auto-scaling database in minutes.


Get Started Free


Experience modern data infrastructure firsthand.

Try TiDB Serverless

Have questions? Let us know how we can help.

Contact Us

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads