Wednesday, September 13, 2023

[389-users] Re: 389-ds freezes with deadlock

Hi Thierry,

> First you may install debuginfo it would help to get a better
> understanding what happens.

I will try to do that the next time it breaks. Unfortunately this is a
production machine and I can't always take the time to do forensics.
Sometimes I just have to quickly get it up running again and just
restart the service completely. I have not yet found a way to trigger
this in my lab environment.

> Do you know if it recovers after that high CPU peak ?

So far it has never recovered. I have seen the high CPU peak 7 or 8
times now and it is always like this:
1. CPU usage peaks on 2 threads
2. Admin from external server tells me that his system cannot do LDAP
operations anymore.
3. I try to do some ldapmodify operations, which succeed and get
replicated correctly.
4. At this point there are 2 options:
a. Both the admin from the external server and I restart our services
which temporarily fixes the issue
b. I don't restart my system and after a few hours (where the CPU
peak does not go away) dirsrv completely freezes up and does not accept
any connections anymore.

> Regarding the unindexed search, you may check if 'changeNumber' is
> indexed (equality). It looks related to a sync_repl search with no
> cookie or old cookie. The search is on a different backend than Thread
> 62, so there is no conflict between the sync_repl unindexed search and
> update on thread62.

The equality index is set for changeNumber. I will assume that this is a
different "problem" and has nothing to do with the high cpu load and
freezes and not look further into it for the time.

Kind regards
Julian

Am 12.09.23 um 14:21 schrieb Thierry Bordaz:
> Hi Julian,
>
> Difficult to say. I do not recall specific issue but I know we fixed
> several bugs in sync_repl.
>
> First you may install debuginfo it would help to get a better
> understanding what happens.
>
> The two threads are likely Thread 62 and trickle thread (2 to 6) because
> of intensive db page update.
> Do you know if it recovers after that high CPU peak ?
> A possibility would be a large update to write back to the changelog.
> You may retrieve the problematic csn in access log (during high cpu) and
> dump the update from the changelog with dbscan (-k).
>
> Regarding the unindexed search, you may check if 'changeNumber' is
> indexed (equality). It looks related to a sync_repl search with no
> cookie or old cookie. The search is on a different backend than Thread
> 62, so there is no conflict between the sync_repl unindexed search and
> update on thread62.
>
> best regards
> thierry
>
> On 9/12/23 13:52, Julian Kippels wrote:
>> Hi,
>>
>> there are two threads that are at 100% CPU utilisation. I did not
>> start any admin task myself, maybe it is some built-in task that is
>> doing this? Or could an unindexed search on the changelog be causing
>> this?
>>
>> I have noticed this message:
>> NOTICE - ldbm_back_search - Unindexed search: search
>> base="cn=changelog" scope=1 filter="(changeNumber>=1)" conn=35871 op=1
>>
>> There is an external server that is reading the changelog and syncing
>> some stuff depending on that. I don't know why they are starting at
>> changeNumber>=1, they probably should start way higher. If it is
>> possible that this is the cause I will kick them to stop that ;)
>>
>> I am running version 2.3.1 on Debian 12, installed from the Debian
>> repositories.
>>
>> Kind regards
>> Julian
>>
>> Am 08.09.23 um 13:23 schrieb Thierry Bordaz:
>>> Hi Julian,
>>>
>>> It looks that an update (Thread 62) is either eating CPU either is
>>> blocked while update the changelog.
>>> When it occurs could you run 'top -H -p <pid>' to see if some thread
>>> are eating CPU.
>>> Else (no cpu consumption), you may take a pstack and dump DB lock
>>> info (db_stat -N -C A -h /var/lib/dirsrv/<inst>db)
>>>
>>> Did you run admin task (import/export/index...) before it occurred ?
>>> What version are you running ?
>>>
>>> best regards
>>> Thierry
>>>
>>> On 9/8/23 09:28, Julian Kippels wrote:
>>>> Hi,
>>>>
>>>> it happened again and now I ran the gdb-command like Mark suggested.
>>>> The Stacktrace is attached. Again I got this error message:
>>>>
>>>> [07/Sep/2023:15:22:43.410333038 +0200] - ERR - ldbm_back_seq -
>>>> deadlock retry BAD 1601, err=0 Unexpected dbimpl error code
>>>>
>>>> and the remote program that called also stopped working at that time.
>>>>
>>>> Thanks
>>>> Julian Kippels
>>>>
>>>> Am 28.08.23 um 14:28 schrieb Thierry Bordaz:
>>>>> Hi Julian,
>>>>>
>>>>> I agree with Mark suggestion. If new connections are failing a
>>>>> pstack + error logged msg would be helpful.
>>>>>
>>>>> Regarding the error logged. LDAP server relies on a database that,
>>>>> under pressure by multiple threads, may end into a db_lock
>>>>> deadlock. In such situation the DB, selects one deadlocking thread,
>>>>> returns a DB_Deadlock error to that thread while the others threads
>>>>> continue to proceed. This is very normal error that is caught by
>>>>> the server that simply retries to access the DB. If the same thread
>>>>> fails to many time, it stops retry and return a fatal error to the
>>>>> request.
>>>>>
>>>>> In your case it reports code 1601 that is transient deadlock with
>>>>> retry. So the impacted request just retried and likely succeeded.
>>>>>
>>>>> best regards
>>>>> thierry
>>>>>
>>>>> On 8/24/23 14:46, Mark Reynolds wrote:
>>>>>> Hi Julian,
>>>>>>
>>>>>> It would be helpful to get a pstack/stacktrace so we can see where
>>>>>> DS is stuck:
>>>>>>
>>>>>> https://www.port389.org/docs/389ds/FAQ/faq.html#sts=Debugging%C2%A0Hangs
>>>>>>
>>>>>> Thanks,
>>>>>> Mark
>>>>>>
>>>>>> On 8/24/23 4:13 AM, Julian Kippels wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am using 389-ds Version 2.3.1 and have encountered the same
>>>>>>> error twice in three days now. There are some MOD operations and
>>>>>>> then I get a line like this in the errors-log:
>>>>>>>
>>>>>>> [23/Aug/2023:13:27:17.971884067 +0200] - ERR - ldbm_back_seq -
>>>>>>> deadlock retry BAD 1601, err=0 Unexpected dbimpl error code
>>>>>>>
>>>>>>> After this the server keeps running, systemctl status says
>>>>>>> everything is fine, but new incoming connections are failing with
>>>>>>> timeouts.
>>>>>>>
>>>>>>> Any advice would be welcome.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>> Julian Kippels
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>>>>>> To unsubscribe send an email to
>>>>>>> 389-users-leave@lists.fedoraproject.org
>>>>>>> Fedora Code of Conduct:
>>>>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>>>>> List Guidelines:
>>>>>>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>>>>> List Archives:
>>>>>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>>>>> Do not reply to spam, report it:
>>>>>>> https://pagure.io/fedora-infrastructure/new_issue
>>>>>>
>>>>> _______________________________________________
>>>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>>>> To unsubscribe send an email to
>>>>> 389-users-leave@lists.fedoraproject.org
>>>>> Fedora Code of Conduct:
>>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>>> List Guidelines:
>>>>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>>> List Archives:
>>>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>>> Do not reply to spam, report it:
>>>>> https://pagure.io/fedora-infrastructure/new_issue
>>>>
>>>>
>>>> _______________________________________________
>>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>>>> Fedora Code of Conduct:
>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives:
>>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>> Do not reply to spam, report it:
>>>> https://pagure.io/fedora-infrastructure/new_issue
>>> _______________________________________________
>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>>> Fedora Code of Conduct:
>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives:
>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>> Do not reply to spam, report it:
>>> https://pagure.io/fedora-infrastructure/new_issue
>>
>>
>> _______________________________________________
>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>> Fedora Code of Conduct:
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>> Do not reply to spam, report it:
>> https://pagure.io/fedora-infrastructure/new_issue
>

--
---------------------------------------------------------
| | Julian Kippels
| | M.Sc. Informatik
| |
| | Zentrum für Informations- und Medientechnologie
| | Heinrich-Heine-Universität Düsseldorf
| | Universitätsstr. 1
| | Raum 25.41.O1.32
| | 40225 Düsseldorf / Germany
| |
| | Tel: +49-211-81-14920
| | mail: kippels@hhu.de
---------------------------------------------------------

No comments:

Post a Comment