Wednesday, July 28, 2021

[389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

So, how do I change nsslapd-db-deadlock-policy? Is this a "local" config, do
I need to change it on all replica's?

On rotte (CentOS7) it is in
   cn=config,cn=ldbm database,cn=plugins,cn=config

On linge en iparep4 (CentOS 8 Stream) it is in
   cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config

On 28-07-2021 16:19, Thierry Bordaz wrote:
> On 7/28/21 3:47 PM, Kees Bakker wrote:
>> When you said:
>> > You may confirm that with a 'grep 60fe8535001000030000
>> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>>
>> On linge there is one hit with err=1, quickly followed by a hit with
>> err=0.
>> Is that a confirmation that replication succeeded after a retry?
>>
> Yes that was a typo the update completed successfully everywhere with err=0
>
>
>> On 28-07-2021 14:36, Thierry Bordaz wrote:
>>> Hi Kees,
>>>
>>> Rotte successfully processed the problematic update
>>> (60fe8535001000030000), updating the database and recording the update
>>> in the changelog.
>>>
>>> Later Rotte tried to replicate the update to linge  but the update
>>> failed on linge
>>>
>>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK:
>>> Locker killed to resolve a deadlock))
>>>
>>> Rotte noticed this failure
>>>
>>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin -
>>> repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>> (linge:389): Consumer failed to replay change (uniqueid
>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>>> Operations error (1). Will retry later
>>>
>>> And like mentioned in the log it retried later to replicate the update
>>> and this time it succeeded. You said the value was correct on all
>>> replicas. You may confirm that with a 'grep 60fe8535001000030000
>>> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>>>
>>> The reason of the original replication failure (on linge) is possibly
>>> related to the deadlock policy. By default DS, in case of DB deadlock,
>>> gives the priority to the youngest transaction and abort the others txn
>>> to resolve a deadlock. This default value works fine but in case of IPA
>>> where updates are very often nested (because of many plugins calls) it
>>> is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
>>> writers).
>>>
>>> DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of
>>> too many retry, the operation itself fails. Replication just sends again
>>> the failing operation. ATM your topology looks healthy you may try to
>>> update the deadlock policy.
>>>
>>> Regards
>>> thierry
>>>
>>>
>>> On 7/28/21 2:10 PM, Kees Bakker wrote:
>>>> Hi,
>>>>
>>>> This is in a IPA deployment. We have three masters/replicas in a
>>>> triangular topology, A-B, B-C, C-A.
>>>> The systems are called: rotte, linge and iparep4.
>>>>
>>>> rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
>>>> linge and iparep4 are CentOS 8 Stream, with
>>>> 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64
>>>>
>>>> Yesterday I removed some members from a user group on rotte. This
>>>> caused the follow errors
>>>> on linge (and on iparep4).
>>>>
>>>> Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
>>>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>>>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>>>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
>>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>> [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin -
>>>> changelog program - _cl5WriteOperationTxn - Failed to write entry with
>>>> csn (60fe8535001000030000); db error - -30993 BDB0068
>>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>> [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin -
>>>> write_changelog_and_ruv - Can't add a change for
>>>> cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
>>>> 60fe8535001000030000
>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>> [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin -
>>>> process_postop - Failed to apply update (60fe8535001000030000) error
>>>> (1).  Aborting replication session(conn=53596 op=65)
>>>>
>>>> On rotte
>>>>
>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
>>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>>> (linge:389): Consumer failed to replay change (uniqueid
>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>>>> Operations error (1). Will retry later.
>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>> [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
>>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>>> (linge:389): Consumer failed to replay change (uniqueid
>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
>>>> Operations error(1). Will retry later.
>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>> [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin -
>>>> release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
>>>> to send endReplication extended operation (Operations error)
>>>> jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
>>>> [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
>>>> - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
>>>> Replication bind with GSSAPI auth resumed
>>>>
>>>> As far as I can see the user group is correctly modified on all
>>>> replicas. But it doesn't
>>>> look healthy to me.
>>>>
>>>> Is there anything I can do to see what went wrong? Is there something
>>>> to improve
>>>> in the configuration?
>>> _______________________________________________
>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>>> Fedora Code of Conduct:
>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>> Do not reply to spam on the list, report it:
>>> https://pagure.io/fedora-infrastructure
>> _______________________________________________
>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>> https://pagure.io/fedora-infrastructure
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

No comments:

Post a Comment