Wednesday, July 28, 2021

[389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

Hi Kees,

Rotte successfully processed the problematic update
(60fe8535001000030000), updating the database and recording the update
in the changelog.

Later Rotte tried to replicate the update to linge  but the update
failed on linge

[26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
changelog program - _cl5WriteOperationTxn - retry (49) the transaction
(csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK:
Locker killed to resolve a deadlock))

Rotte noticed this failure

[26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin -
repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
(linge:389): Consumer failed to replay change (uniqueid
31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
Operations error (1). Will retry later

And like mentioned in the log it retried later to replicate the update
and this time it succeeded. You said the value was correct on all
replicas. You may confirm that with a 'grep 60fe8535001000030000
<rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1

The reason of the original replication failure (on linge) is possibly
related to the deadlock policy. By default DS, in case of DB deadlock,
gives the priority to the youngest transaction and abort the others txn
to resolve a deadlock. This default value works fine but in case of IPA
where updates are very often nested (because of many plugins calls) it
is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
writers).

DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of
too many retry, the operation itself fails. Replication just sends again
the failing operation. ATM your topology looks healthy you may try to
update the deadlock policy.

Regards
thierry


On 7/28/21 2:10 PM, Kees Bakker wrote:
> Hi,
>
> This is in a IPA deployment. We have three masters/replicas in a
> triangular topology, A-B, B-C, C-A.
> The systems are called: rotte, linge and iparep4.
>
> rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
> linge and iparep4 are CentOS 8 Stream, with
> 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64
>
> Yesterday I removed some members from a user group on rotte. This
> caused the follow errors
> on linge (and on iparep4).
>
> Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
> [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin -
> changelog program - _cl5WriteOperationTxn - Failed to write entry with
> csn (60fe8535001000030000); db error - -30993 BDB0068
> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
> [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin -
> write_changelog_and_ruv - Can't add a change for
> cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
> 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
> 60fe8535001000030000
> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
> [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin -
> process_postop - Failed to apply update (60fe8535001000030000) error
> (1).  Aborting replication session(conn=53596 op=65)
>
> On rotte
>
> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
> (linge:389): Consumer failed to replay change (uniqueid
> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
> Operations error (1). Will retry later.
> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
> [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
> (linge:389): Consumer failed to replay change (uniqueid
> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
> Operations error(1). Will retry later.
> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
> [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin -
> release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
> to send endReplication extended operation (Operations error)
> jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
> [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
> - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
> Replication bind with GSSAPI auth resumed
>
> As far as I can see the user group is correctly modified on all
> replicas. But it doesn't
> look healthy to me.
>
> Is there anything I can do to see what went wrong? Is there something
> to improve
> in the configuration?
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

No comments:

Post a Comment