Wednesday, July 28, 2021

[389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

On 7/28/21 3:47 PM, Kees Bakker wrote:
> When you said:
> > You may confirm that with a 'grep 60fe8535001000030000
> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>
> On linge there is one hit with err=1, quickly followed by a hit with
> err=0.
> Is that a confirmation that replication succeeded after a retry?
>
Yes that was a typo the update completed successfully everywhere with err=0


> On 28-07-2021 14:36, Thierry Bordaz wrote:
>> Hi Kees,
>>
>> Rotte successfully processed the problematic update
>> (60fe8535001000030000), updating the database and recording the update
>> in the changelog.
>>
>> Later Rotte tried to replicate the update to linge  but the update
>> failed on linge
>>
>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK:
>> Locker killed to resolve a deadlock))
>>
>> Rotte noticed this failure
>>
>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin -
>> repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>> (linge:389): Consumer failed to replay change (uniqueid
>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>> Operations error (1). Will retry later
>>
>> And like mentioned in the log it retried later to replicate the update
>> and this time it succeeded. You said the value was correct on all
>> replicas. You may confirm that with a 'grep 60fe8535001000030000
>> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>>
>> The reason of the original replication failure (on linge) is possibly
>> related to the deadlock policy. By default DS, in case of DB deadlock,
>> gives the priority to the youngest transaction and abort the others txn
>> to resolve a deadlock. This default value works fine but in case of IPA
>> where updates are very often nested (because of many plugins calls) it
>> is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
>> writers).
>>
>> DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of
>> too many retry, the operation itself fails. Replication just sends again
>> the failing operation. ATM your topology looks healthy you may try to
>> update the deadlock policy.
>>
>> Regards
>> thierry
>>
>>
>> On 7/28/21 2:10 PM, Kees Bakker wrote:
>>> Hi,
>>>
>>> This is in a IPA deployment. We have three masters/replicas in a
>>> triangular topology, A-B, B-C, C-A.
>>> The systems are called: rotte, linge and iparep4.
>>>
>>> rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
>>> linge and iparep4 are CentOS 8 Stream, with
>>> 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64
>>>
>>> Yesterday I removed some members from a user group on rotte. This
>>> caused the follow errors
>>> on linge (and on iparep4).
>>>
>>> Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
>>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>> [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin -
>>> changelog program - _cl5WriteOperationTxn - Failed to write entry with
>>> csn (60fe8535001000030000); db error - -30993 BDB0068
>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>> [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin -
>>> write_changelog_and_ruv - Can't add a change for
>>> cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
>>> 60fe8535001000030000
>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>> [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin -
>>> process_postop - Failed to apply update (60fe8535001000030000) error
>>> (1).  Aborting replication session(conn=53596 op=65)
>>>
>>> On rotte
>>>
>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>> (linge:389): Consumer failed to replay change (uniqueid
>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>>> Operations error (1). Will retry later.
>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>> [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>> (linge:389): Consumer failed to replay change (uniqueid
>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
>>> Operations error(1). Will retry later.
>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>> [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin -
>>> release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
>>> to send endReplication extended operation (Operations error)
>>> jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
>>> [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
>>> - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
>>> Replication bind with GSSAPI auth resumed
>>>
>>> As far as I can see the user group is correctly modified on all
>>> replicas. But it doesn't
>>> look healthy to me.
>>>
>>> Is there anything I can do to see what went wrong? Is there something
>>> to improve
>>> in the configuration?
>> _______________________________________________
>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>> https://pagure.io/fedora-infrastructure
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

No comments:

Post a Comment