Authors: Chenhao Huang (Developer Relations Specialist), Phoebe He (Technical Marketing Manager)
Editor: Tom Dewan
Real-time insight is of vital importance for many companies in various industries; for example, for those in the logistics field. The reasons are two-fold: to increase customer satisfaction and reduce operation cost.
On-time performance (OTP) is one of the most important key performance indicators (KPI) that companies pursue. In the logistics field, OTP refers to the percentage of packages delivered to the customers on time. A higher percentage usually means better performance, and thus higher customer satisfaction.
The overall cost for logistics companies include transportation costs, inventory carrying costs, and administration costs. Big data and a secondary distribution logistics system with true real-time dashboards and instant scheduling play important roles, because they reduce all three costs.
Bottlenecks in the logistics industry may come from many places, such as a blocked highway due to a traffic accident, delayed flights, and congested couriers. In each case the companies must respond swiftly, taking actions including changing the routines and bringing more couriers online. Real-time insights can alert the company as soon as the bottleneck occurs, so they can take actions like last minute rerouting.
A happy problem: real-time insights at scale
Providing real-time insights is difficult, mostly due to the huge dataset size and the limited processing capability of traditional single node databases. According to Pitney Bowes Inc’s new Parcel Shipping Index featuring 2020 data from 13 major markets around the world:
- In 2020, parcel volume reached 131.2 billion, up 27% from 103.2 billion. Total US volume reached 20 billion parcels.
- 4,160 parcels shipped per second, compared to 3,248 in 2019.
- Parcel revenue reached $429.5 billion, up 22% since 2019. The top three markets by revenue—the US at $171.4 billion, China at $127.5 billion and Japan at $33.3 billion—represented 77 percent of global parcel revenue in 2020
- China remains the largest market in terms of parcel volume, reaching 83.4 billion, up 31% from 63.5 billion. The US is second at 20.2 billion, followed by Japan at 9.1 billion.
- The US remained the market with the highest carrier revenue, reaching $171.4 billion, an increase of 29% year over year
Traditionally, a single node database is designed to handle data volume up to a couple of hundred GB. The sharding solution is commonly used to cope with a huge amount of data. It often splits the dataset into a number of tables or even other databases. The sharding solution partially fulfills daily operating requirements. However, it makes real-time analytics almost impossible: the sharded data, stored in row format, must go through extract, transform, load (ETL) processes to be directed into the data warehouse, before it can be analyzed. To avoid impacting the foreground Online Transactional Processing (OLTP) requests, the ETL is usually done at midnight. Even worse, some companies can only perform ETL on weekends.
Without fresh insights, the management team or the resource allocation team are “flying blind”. They can not respond promptly when unexpected situations occur, such as traffic jams or order boosts. The couriers have to deliver packages using less than efficient routes, and they must work overtime to get packages delivered. Customers are also unhappy. Without effective plannings enabled by real-time analytics, the customer may suffer from package delay, or loss.
Enabling real-time insights for parcel logistics with TiDB Cloud
Business requirements from the logistics industry
Customers need a personal page that shows their order histories and tracks their parcels. In addition, the software system needs to answer the questions like, where are the packages for a particular customer, and the number of packages a customer receives each day.
Sample dashboard of an end user (built using Metabase)
Logistics companies need to compute and monitor critical business metrics, and receive alerts when a metric exceeds a certain threshold. This alerts the team to take action. For example, in the following dashboard, the number of packages in processing for courier_4 is above the threshold. It highlights the metric in color to alert the ops team so they can assign additional couriers to share the workload. Another metric could track the number of packages in a particular state. For example, if We may also find that Texas has a large number of packages in process, distribution centers in Texas can be prepared in advance.
A logistics company’s monitoring dashboard (built using Metabase)
To simplify, we divide the life cycle of a package delivery into 4 stages:
- Stage 1 (“1_pkg_in”): a package is received by the logistics company.
- Stage 2 (“2_ori”): a package arrives at the processing center in the origin state.
- Stage 3 (“3_des”): a package reaches the processing center in the destination state.
- Stage 4 (“4_pkg_out”): a package is delivered to the customer.
As shown below, we use a wide giant table for data collection. It includes fields for the package_id, start_time, update_time, sender, courier, receiver, start_state, destination_state, and transaction_kind. Transaction kind describes which stage the transaction is in.
All of these requirements mentioned above, from both end user and logistics company’s perspective, involve large data aggregation and sometimes full table scan. For example, to find out the number of packages in each state, you have to group the records by state and count the number of records in each one. You might also need to filter out the packages which are already delivered. If you would like to find out the number of new packages per day, you have to group all the records by starting date and then count the number.
As you can see now, these analytical queries usually consume huge amounts of resources, and are very likely to impact the OLTP queries. This is not what most logistics companies want. They expect to get real-time insights about various situations without downgrading customers’ experience.
Architecture of the solution
In this blog post, we present a reference architecture which can provide real-time logistical insights with TiDB. The overall architecture is shown below.
High-level design: all parties of interest talk to a single database
For the database, we use TiDB Cloud with TiFlash enabled. Everyone talks to a single database, which is TiDB. The end users can submit orders or track their orders from webpages or mobile phone apps. The requests will then be sent to TiDB. The postmen deliver packages and update the status of the records in TiDB via postmen’s portal. The managers from the ops team get real-time insights from the business intelligence (BI) dashboard, which fetches up-to-dated data from TiDB.
The system architecture using TiDB is straightforward. You can imagine, without a Hybrid Transactional and Analytical Processing (HTAP) database, the logistic company has to use at least one operational database and one data warehouse, with an ETL system in between. This design would be much more complicated.
Build App Now Try TiDB Cloud Free
How does TiDB achieve real-time query performance?
Under the hood of TiDB Cloud, there are two storage systems: TiKV (row storage) and TiFlash (column storage). TiKV can be used for transactional processing, while TiFlash is dedicated to analytical processing. TiKV divides data into Regions. To achieve high availability (HA), each Region is replicated three times. All the replicas form a Raft Group. TiFlash is a learner in the Raft Group. It has two major benefits:
- The data in TiFlash is always kept up-to-date, which enables real-time analytics.
- TiFlash and TiKV have separate hardware resources, so that the analytical queries never impact the performance of the transactional requests.
When a query comes in, either from the end users, the postmen or the managers from the ops team, the Smart Selection algorithm will automatically decide where to send the request: either to TiKV (row-store) or TiFlash (column store). OLTP queries, such as the one shown below, are sent to the TiKV. In this query, a courier delivers the package (package_id = 12345
) to the end user, and thus updates the transaction_kind
to 4_pkg_out
.
UPDATE packages set transaction_kind="4_pkg_out" + WHERE package_id=12345;
The analytical queries from the ops team are usually sent to the column store (TiFlash), because those queries usually aggregate and/or group by a field (column). For instance, the following query counts the number of packages in the processing center of each state. There are two situations where a package appears in the processing center in a state:
- The package is sent to the processing center of origin state, but it is not sent out yet (
transaction_kind = "1_pkg_in"
) - The package arrives at the processing center of the destination state, but it is not yet sent to the end user (
transaction_kind = "3_des"
). In this query, we have both aggregate and group by operations.
The performance of this query will be much better when sent to the TiFlash.
SELECT start_state, COUNT(package_id)
FROM packages
WHERE transaction_kind = "1_pkg_in"
GROUP BY start_state
UNION
SELECT destination_state, COUNT(package_id)
FROM packages
WHERE transaction_kind = "3_des"
GROUP BY destination_state
The following table shows part of the execution plan for the query above. It shows that when TiFlash is enabled all the operations are sent to TiFlash.
ID | Task |
---|---|
Projection_5 | root |
└─TableReader_38 | root |
└─ExchangeSender_37 | batchCop[tiflash] |
└─Projection_33 | batchCop[tiflash] |
└─HashAgg_34 | batchCop[tiflash] |
└─ExchangeReceiver_36 | batchCop[tiflash] |
└─ExchangeSender_35 | batchCop[tiflash] |
└─HashAgg_10 | batchCop[tiflash] |
└─Selection_32 | batchCop[tiflash] |
└─TableFullScan_31 | batchCop[tiflash] |
The following example shows why column store (TiFlash) serves OLAP queries better. Suppose we have the table shown below and we would like to find out the number of packages sent to California (SELECT COUNT(*) FROM packages WHERE destination = “California“
).
ID | Start_time | Sender | Destination |
01 | 2022-Jan-3 | Alice | California |
05 | 2022-Jan-5 | Bob | Washington |
In the row store, the data is organized row-by-row on the storage system.
01 | 2022-Jan-3 | Alice | California | 05 | 2022-Jan-5 | Bob | Washington |
In the column store, however, the data of the same fields are located together in the storage system.
01 | 05 | 2022-Jan-3 | 2022-Jan-5 | Alice | Bob | California | Washington |
When a storage engine executes a query, it has to fetch data from the low tier of the storage hierarchy to the upper level of the storage hierarchy. For example, the data may be fetched from the hard disk to the memory, and then to the cache. Nowadays, the I/O is the bottleneck. If an analytical query is sent to a row store, the whole dataset might need to be fetched. In contrast, only the relevant part of the dataset needs to be fetched. (In our example, this is only the rightmost part of the dataset.) . This saves huge amounts of time.
Conclusion
To sum up, with TiDB Cloud, the ops team in a logistics company can better understand their business moment by moment and take early actions when needed. The team can also give couriers clear instructions on where to go, thus saving fuel, and more importantly, include a lunch break for every employee.
Spin up a Serverless database with 25GiB free resources.
TiDB Cloud Dedicated
A fully-managed cloud DBaaS for predictable workloads
TiDB Cloud Serverless
A fully-managed cloud DBaaS for auto-scaling workloads