Wednesday, July 28, 2021

[389-users] Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

On 7/28/21 4:43 PM, Kees Bakker wrote:
> So, how do I change nsslapd-db-deadlock-policy? Is this a "local"
> config, do
> I need to change it on all replica's?
I think so.
>
> On rotte (CentOS7) it is in
>    cn=config,cn=ldbm database,cn=plugins,cn=config
>
> On linge en iparep4 (CentOS 8 Stream) it is in
>    cn=bdb,cn=config,cn=ldbm database,cn=plugins,cn=config


In CentOS 8 compare to 7, there was work done to split code related to
backend and database. As a consequence a new entry cn=bdb was created
that contains database specific tunnings. Before those tunings were in
cn=config,cn=ldbm database,cn=plugins,cn=config along with backend
specific tunnings.

deadlock-policy is database (bdb) specific. So you may tune it in ldbm
config in 7 and bdb in 8.

regards
thierry


>
> On 28-07-2021 16:19, Thierry Bordaz wrote:
>> On 7/28/21 3:47 PM, Kees Bakker wrote:
>>> When you said:
>>> > You may confirm that with a 'grep 60fe8535001000030000
>>> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>>>
>>> On linge there is one hit with err=1, quickly followed by a hit with
>>> err=0.
>>> Is that a confirmation that replication succeeded after a retry?
>>>
>> Yes that was a typo the update completed successfully everywhere with
>> err=0
>>
>>
>>> On 28-07-2021 14:36, Thierry Bordaz wrote:
>>>> Hi Kees,
>>>>
>>>> Rotte successfully processed the problematic update
>>>> (60fe8535001000030000), updating the database and recording the update
>>>> in the changelog.
>>>>
>>>> Later Rotte tried to replicate the update to linge  but the update
>>>> failed on linge
>>>>
>>>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
>>>> changelog program - _cl5WriteOperationTxn - retry (49) the transaction
>>>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
>>>> DB_LOCK_DEADLOCK:
>>>> Locker killed to resolve a deadlock))
>>>>
>>>> Rotte noticed this failure
>>>>
>>>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN -
>>>> NSMMReplicationPlugin -
>>>> repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>>> (linge:389): Consumer failed to replay change (uniqueid
>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>>>> Operations error (1). Will retry later
>>>>
>>>> And like mentioned in the log it retried later to replicate the update
>>>> and this time it succeeded. You said the value was correct on all
>>>> replicas. You may confirm that with a 'grep 60fe8535001000030000
>>>> <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1
>>>>
>>>> The reason of the original replication failure (on linge) is possibly
>>>> related to the deadlock policy. By default DS, in case of DB deadlock,
>>>> gives the priority to the youngest transaction and abort the others
>>>> txn
>>>> to resolve a deadlock. This default value works fine but in case of
>>>> IPA
>>>> where updates are very often nested (because of many plugins calls) it
>>>> is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
>>>> writers).
>>>>
>>>> DB_LOCK_DEADLOCK is a normal event. The server just retries. In
>>>> case of
>>>> too many retry, the operation itself fails. Replication just sends
>>>> again
>>>> the failing operation. ATM your topology looks healthy you may try to
>>>> update the deadlock policy.
>>>>
>>>> Regards
>>>> thierry
>>>>
>>>>
>>>> On 7/28/21 2:10 PM, Kees Bakker wrote:
>>>>> Hi,
>>>>>
>>>>> This is in a IPA deployment. We have three masters/replicas in a
>>>>> triangular topology, A-B, B-C, C-A.
>>>>> The systems are called: rotte, linge and iparep4.
>>>>>
>>>>> rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
>>>>> linge and iparep4 are CentOS 8 Stream, with
>>>>> 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64
>>>>>
>>>>> Yesterday I removed some members from a user group on rotte. This
>>>>> caused the follow errors
>>>>> on linge (and on iparep4).
>>>>>
>>>>> Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
>>>>> [26/Jul/2021:11:44:37.947738548 +0200] - ERR -
>>>>> NSMMReplicationPlugin -
>>>>> changelog program - _cl5WriteOperationTxn - retry (49) the
>>>>> transaction
>>>>> (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
>>>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>>> [26/Jul/2021:11:44:38.000964611 +0200] - ERR -
>>>>> NSMMReplicationPlugin -
>>>>> changelog program - _cl5WriteOperationTxn - Failed to write entry
>>>>> with
>>>>> csn (60fe8535001000030000); db error - -30993 BDB0068
>>>>> DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>>> [26/Jul/2021:11:44:38.025996273 +0200] - ERR -
>>>>> NSMMReplicationPlugin -
>>>>> write_changelog_and_ruv - Can't add a change for
>>>>> cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
>>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
>>>>> 60fe8535001000030000
>>>>> Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
>>>>> [26/Jul/2021:11:44:38.062640602 +0200] - ERR -
>>>>> NSMMReplicationPlugin -
>>>>> process_postop - Failed to apply update (60fe8535001000030000) error
>>>>> (1).  Aborting replication session(conn=53596 op=65)
>>>>>
>>>>> On rotte
>>>>>
>>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>>> [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
>>>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>>>> (linge:389): Consumer failed to replay change (uniqueid
>>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
>>>>> Operations error (1). Will retry later.
>>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>>> [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
>>>>> - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
>>>>> (linge:389): Consumer failed to replay change (uniqueid
>>>>> 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
>>>>> Operations error(1). Will retry later.
>>>>> jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
>>>>> [26/Jul/2021:11:44:39.069825407 +0200] - ERR -
>>>>> NSMMReplicationPlugin -
>>>>> release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
>>>>> to send endReplication extended operation (Operations error)
>>>>> jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
>>>>> [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
>>>>> - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
>>>>> Replication bind with GSSAPI auth resumed
>>>>>
>>>>> As far as I can see the user group is correctly modified on all
>>>>> replicas. But it doesn't
>>>>> look healthy to me.
>>>>>
>>>>> Is there anything I can do to see what went wrong? Is there something
>>>>> to improve
>>>>> in the configuration?
>>>> _______________________________________________
>>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>>> To unsubscribe send an email to
>>>> 389-users-leave@lists.fedoraproject.org
>>>> Fedora Code of Conduct:
>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>> List Guidelines:
>>>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives:
>>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>>
>>>> Do not reply to spam on the list, report it:
>>>> https://pagure.io/fedora-infrastructure
>>> _______________________________________________
>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>>> Fedora Code of Conduct:
>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>
>>> Do not reply to spam on the list, report it:
>>> https://pagure.io/fedora-infrastructure
>> _______________________________________________
>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>> https://pagure.io/fedora-infrastructure
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

No comments:

Post a Comment