Wednesday, July 28, 2021

[389-users] DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

Hi,

This is in a IPA deployment. We have three masters/replicas in a triangular topology, A-B, B-C, C-A.
The systems are called: rotte, linge and iparep4.

rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
linge and iparep4 are CentOS 8 Stream, with 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64

Yesterday I removed some members from a user group on rotte. This caused the follow errors
on linge (and on iparep4).

Jul 26 11:44:37 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (60fe8535001000030000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid: 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn 60fe8535001000030000
Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (60fe8535001000030000) error (1).  Aborting replication session(conn=53596 op=65)

On rotte

jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com" (linge:389): Consumer failed to replay change (uniqueid 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000): Operations error (1). Will retry later.
jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com" (linge:389): Consumer failed to replay change (uniqueid 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000): Operations error(1). Will retry later.
jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable to send endReplication extended operation (Operations error)
jul 26 11:44:46 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389): Replication bind with GSSAPI auth resumed

As far as I can see the user group is correctly modified on all replicas. But it doesn't
look healthy to me.

Is there anything I can do to see what went wrong? Is there something to improve
in the configuration?
--
Kees
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

No comments:

Post a Comment