Fedora Info: [389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

When you said:
> You may confirm that with a 'grep 60fe8535001000030000 <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1

On linge there is one hit with err=1, quickly followed by a hit with err=0.
Is that a confirmation that replication succeeded after a retry?

On 28-07-2021 14:36, Thierry Bordaz wrote:
> Hi Kees,
>
> Rotte successfully processed the problematic update
> (60fe8535001000030000), updating the database and recording the update
> in the changelog.
>
> Later Rotte tried to replicate the update to linge but the update
> failed on linge
>
> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK:
> Locker killed to resolve a deadlock))
>
> Rotte noticed this failure
>
> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin -
> repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
> (linge:389): Consumer failed to replay change (uniqueid
> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
> Operations error (1). Will retry later
>
> And like mentioned in the log it retried later to replicate the update
> and this time it succeeded. You said the value was correct on all
> replicas. You may confirm that with a 'grep 60fe8535001000030000
> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>
> The reason of the original replication failure (on linge) is possibly
> related to the deadlock policy. By default DS, in case of DB deadlock,
> gives the priority to the youngest transaction and abort the others txn
> to resolve a deadlock. This default value works fine but in case of IPA
> where updates are very often nested (because of many plugins calls) it
> is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
> writers).
>
> DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of
> too many retry, the operation itself fails. Replication just sends again
> the failing operation. ATM your topology looks healthy you may try to
> update the deadlock policy.
>
> Regards
> thierry
>
>
> On 7/28/21 2:10 PM, Kees Bakker wrote:
>> Hi,
>>
>> This is in a IPA deployment. We have three masters/replicas in a
>> triangular topology, A-B, B-C, C-A.
>> The systems are called: rotte, linge and iparep4.
>>
>> rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
>> linge and iparep4 are CentOS 8 Stream, with
>> 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64
>>
>> Yesterday I removed some members from a user group on rotte. This
>> caused the follow errors
>> on linge (and on iparep4).
>>
>> Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>> [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin -
>> changelog program - _cl5WriteOperationTxn - Failed to write entry with
>> csn (60fe8535001000030000); db error - -30993 BDB0068
>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>> [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin -
>> write_changelog_and_ruv - Can't add a change for
>> cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
>> 60fe8535001000030000
>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>> [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin -
>> process_postop - Failed to apply update (60fe8535001000030000) error
>> (1). Aborting replication session(conn=53596 op=65)
>>
>> On rotte
>>
>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>> (linge:389): Consumer failed to replay change (uniqueid
>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>> Operations error (1). Will retry later.
>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>> [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>> (linge:389): Consumer failed to replay change (uniqueid
>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
>> Operations error(1). Will retry later.
>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>> [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin -
>> release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
>> to send endReplication extended operation (Operations error)
>> jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
>> [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
>> - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
>> Replication bind with GSSAPI auth resumed
>>
>> As far as I can see the user group is correctly modified on all
>> replicas. But it doesn't
>> look healthy to me.
>>
>> Is there anything I can do to see what went wrong? Is there something
>> to improve
>> in the configuration?
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

Fedora Info

Wednesday, July 28, 2021

[389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

No comments:

Post a Comment