The Data Migration tool uses a sharding DDL lock to ensure operations are applied in the correct order. This locking mechanism works automatically, but in some abnormal conditions you might need to perform manual operations such as force-releasing the lock.
This document shows how to troubleshoot sharding DDL locks in different abnormal conditions.
The possible causes of an abnormal condition include:
Do not use
break-ddl-lockunless you are definitely clear about the possible impacts brought by this command and you can accept the impacts.
Before the DM-master tries to automatically unlock the sharding DDL lock, all the DM-workers need to receive the sharding DDL event. If the sharding DDL operation is already in the replication process, and some DM-workers have gone offline and are not to be restarted, then the sharding DDL lock cannot be automatically replicated and unlocked because not all the DM-workers can receive the DDL event.
If you do not need to make some DM-workers offline in the process of replicating sharding DDL statements, a better solution is using
stop-task to stop the running task first, then make the DM-workers offline, and finally use
start-task and the new task configuration that does not contain the already offline DM-workers to restart the task.
If the owner goes offline when the owner has finished executing the DDL statement but other DM-workers have not skipped this DDL statement. For the solution, see Condition two: a DM-worker restarts.
show-ddl-locks to obtain the information of the sharding DDL lock that is currently pending replication.
unlock-ddl-lock command to specify the information of the lock to be unlocked manually.
--ownerparameter to specify another DM-worker as the new owner to execute the DDL statement.
show-ddl-locks to check whether this lock has been successfully unlocked.
After you have manually unlocked the lock, it still might exist that the lock cannot be automatically replicated when the next sharding DDL event is received, because the offline DM-workers are included in the task configuration information.
Therefore, after you have manually unlocked the DM-workers, you need to use
start-task and the updated task configuration that does not include offline DM-workers to restart the task.
If the DM-workers that went offline become online again after you run
unlock-ddl-lock, it means: These DM-workers will replicate the unlocked DDL operation again. (Other DM-workers that were not offline have replicated the DDL statement.) The DDL operation of these DM-workers will try to match the subsequent replicated DDL statements of other DM-workers. A match error of replicating sharding DDL statements of different DM-workers might occur.
Currently, the DDL unlocking process is not atomic, during which the DM-master schedules multiple DM-workers to execute or skip the sharding DDL statement and updates the checkpoint. Therefore, it might exist that after the owner finishes executing the DDL statement, a non-owner restarts before it skips this DDL statement and updates the checkpoint. At this time, the lock information on the DM-master has been removed but the restarted DM-worker has failed to skip this DDL statement and update the checkpoint.
After the DM-worker restarts and runs
start-task, it retries to replicate the sharding DDL statement. But as other DM-workers have finished replicating this DDL statement, the restarted DM-worker cannot replicate or skip this DDL statement.
query-status to check the information of the sharding DDL statement that the restarted DM-worker is currently blocking.
break-ddl-lock to specify the DM-worker that is to break the lock forcefully.
skipto skip the sharding DDL statement.
query-status to check whether the lock has been successfully broken.
No bad impact. After you have manually broken the lock, the subsequent sharding DDL statements can be automatically replicated normally.
After a DM-worker sends the sharding DDL information to DM-master, this DM-worker will hang up, wait for the message from DM-master, and then decide whether to execute or skip this DDL statement.
Because the state of DM-master is not persistent, the lock information that a DM-worker sends to DM-master will be lost if DM-master restarts.
Therefore, DM-master cannot schedule the DM-worker to execute or skip the DDL statement after DM-master restarts due to lock information loss.
show-ddl-locksto verify whether the sharding DDL lock information is lost.
query-statusto verify whether the DM-worker is blocked as it is waiting for replication of the sharding DDL lock.
pause-taskto pause the blocked task.
resume-taskto resume the blocked task and restart replicating the sharding DDL lock.
No bad impact. After you have manually paused and resumed the task, the DM-worker resumes replicating the sharding DDL lock and sends the lost lock information to DM-master. The subsequent sharding DDL statements can be replicated normally.
skipat the same time.
skipat the same time.