Wednesday, August 11, 2021

[389-users] Re: Lock table is out of available lock entries

Hi,
I suspect that Kees discovered a new bug ...

As the retro changelog is persistent across reboot, it is normal that their timestamp are absolute, so IMHO there is no way relative time should be used/seen in trimming computation.
(in other words cur_time is wrong. )
My suspicion is that we got a regression when the slapi_eq timers where  rewritten to use relative time instead of absolute time) (cur_time is provided by slapi_eq API and is now relative ( it should rather be initialized to time(0) )
The funny thing is that nobody have seen that retrocl trimming was broken for so long ...

Regards Pierre,

On Wed, Aug 11, 2021 at 11:50 AM Kees Bakker <keesb@ghs.com> wrote:
On 09-08-2021 20:23, Mark Reynolds wrote:
> On 8/9/21 1:47 PM, Kees Bakker wrote:
>> On 09-08-2021 19:25, Mark Reynolds wrote:
>>> On 8/9/21 1:16 PM, Kees Bakker wrote:
>>>> On 09-08-2021 18:43, Mark Reynolds wrote:
>>>>> On 8/9/21 11:20 AM, Kees Bakker wrote:
>>>>>> On 09-08-2021 16:00, Mark Reynolds wrote:
>>>>>>> On 8/9/21 8:09 AM, Kees Bakker wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> When my dirsrv was trying to compact the databases I was getting
>>>>>>>> this
>>>>>>>> error
>>>>>>>>
>>>>>>>> [07/Aug/2021:23:59:02.715984489 +0200] - NOTICE - bdb_compact -
>>>>>>>> Compacting databases ...
>>>>>>>> [07/Aug/2021:23:59:02.765932397 +0200] - NOTICE - bdb_compact -
>>>>>>>> Compacting DB start: userRoot
>>>>>>>> [07/Aug/2021:23:59:03.518175414 +0200] - NOTICE - bdb_compact -
>>>>>>>> compactdb: compact userRoot - 417 pages freed
>>>>>>>> [07/Aug/2021:23:59:03.576427786 +0200] - NOTICE - bdb_compact -
>>>>>>>> Compacting DB start: ipaca
>>>>>>>> [07/Aug/2021:23:59:03.659941533 +0200] - NOTICE - bdb_compact -
>>>>>>>> compactdb: compact ipaca - 419 pages freed
>>>>>>>> [07/Aug/2021:23:59:03.718445310 +0200] - NOTICE - bdb_compact -
>>>>>>>> Compacting DB start: changelog
>>>>>>>> [08/Aug/2021:00:00:40.807571334 +0200] - NOTICE -
>>>>>>>> NSMMReplicationPlugin - changelog program - cl5CompactDBs -
>>>>>>>> compacting
>>>>>>>> replication changelogs...
>>>>>>>> [08/Aug/2021:00:00:54.309357211 +0200] - ERR - libdb - BDB2055 Lock
>>>>>>>> table is out of available lock entries
>>>>>>>> [08/Aug/2021:00:00:54.726504736 +0200] - ERR - bdb_compact -
>>>>>>>> compactdb: failed to compact changelog; db error - 12 Cannot
>>>>>>>> allocate
>>>>>>>> memory
>>>>>>>> [08/Aug/2021:00:00:54.801571421 +0200] - ERR - libdb - BDB2055 Lock
>>>>>>>> table is out of available lock entries
>>>>>>>> [08/Aug/2021:00:00:54.876618702 +0200] - ERR -
>>>>>>>> NSMMReplicationPlugin -
>>>>>>>> changelog program - cl5CompactDBs - Failed to compact
>>>>>>>> a797bb0b-be1d11eb-88c0b677-613aa2ad; db error - 12 Cannot allocate
>>>>>>>> memory
>>>>>>>> [08/Aug/2021:00:00:57.253006449 +0200] - NOTICE - bdb_compact -
>>>>>>>> Compacting databases finished.
>>>>>>>>
>>>>>>>> There are about 402k entries in cn=changelog.
>>>>>>>>
>>>>>>>> I have a few questions
>>>>>>>> 1) is it normal to have so many entries in cn=changelog? On another
>>>>>>>> replica I have almost 3M entries Isn't this cleaned up?
>>>>>>>> 2) the number of locks is 50000 (there are two config items).
>>>>>>>> Should I
>>>>>>>> increase that number? If so, increase to what?
>>>>>>>> 3) is there maybe something else going on, causing the
>>>>>>>> exhaustion of
>>>>>>>> the locks?
>>>>>>>
>>>>>>> Ok, so by default there is no changelog trimming enabled. So the
>>>>>>> changelog will grow without bounds, which is bad.
>>>>>>
>>>>>> How much of this [1] applies? Indeed it says "By default the
>>>>>> Replication Changelog does not use any trimming by default."
>>>>>> So, how is the trimming actually enabled? I have these entries
>>>>>>
>>>>>> dn: cn=changelog5,cn=config
>>>>>> cn: changelog5
>>>>>> nsslapd-changelogdir: /var/lib/dirsrv/slapd-EXAMPLE-COM/cldb
>>>>>> nsslapd-changelogmaxage: 30d
>>>>>> objectClass: top
>>>>>> objectClass: extensibleobject
>>>>>
>>>>> So trimming is set, but it's set to 30 days (30d), that
>>>>> could/should be
>>>>> shortened to "7d".
>>>>
>>>> Somehow I doubt that this will do anything.
>>>
>>> Ahhh, I thought this was the Replication Changelog, but it is the
>>> "Retro" Changelog.  Ok so check this out:
>>>
>>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/administration_guide/managing_replication-using_the_retro_changelog_plug_in#Using_the_Retro_Changelog_Plug_in-Trimming_the_Retro_Changelog
>>>
>>>
>>> Or update config directly using ldapmodify:
>>>
>>> # ldapmodify -D ...
>>> dn: cn=Retro Changelog Plugin,cn=plugins,cn=config
>>> changetype: modify
>>> replace: nsslapd-changelogmaxage: 7d
>>> nsslapd-changelogmaxage: 7d
>>
>> (( Typo? The replace line is without the value ))
> yes, sorry...
>>
>> Hmm, that maxage was already at 2 days
>
> Ok, perhaps you are running into a known bug where indexes created by
> plugins are not enabled.  Or might be a config issue?
>
> Try this first.  Set in the retro changelog entry:
>
> nsslapd-changelog-trim-interval: 300   <--- 5 minutes
(Sorry to be picky, but it is nsslapd-changelogtrim-interval. It's default is 300.)
>
> Restart the server to make sure it was applied.  You can enable "plugin"
> logging to see what it does.
>
> nsslapd-errorlog-level: 65536
>
> wait over 5 minutes, then grep the errors log for "DSRetroclPlugin".
> This will tell us what the plugin is doing when it goes to trim the
> database.
>
> IMPORTANT, disable plugin logging immediately after you get your data.
> Set the errorlog level to 0 to go back to default logging.
>
> If the logging shows it's not finding any candidates then it might be
> the indexing issue I mentioned above.  In that case you just need to
> reindex the retro changelog:

Before changing anything I enabled Plug-in debugging which revealed something interesting

[root@linge ~]# grep trim /var/log/dirsrv/slapd-GHS-NL/errors-plugin-debug-20210811-1108
[11/Aug/2021:11:01:49.792450548 +0200] - DEBUG - DSRetroclPlugin - cltrim: ldrc=0, first_time=1622032141, cur_time=3702600
[11/Aug/2021:11:01:49.842339467 +0200] - DEBUG - DSRetroclPlugin - retrocl_housekeeping - changelog does not need to be trimmed
[11/Aug/2021:11:06:49.953401407 +0200] - DEBUG - DSRetroclPlugin - cltrim: ldrc=0, first_time=1622032141, cur_time=3702900
[11/Aug/2021:11:06:50.012249703 +0200] - DEBUG - DSRetroclPlugin - retrocl_housekeeping - changelog does not need to be trimmed

This shows that the interval is 300 (even if nsslapd-changelogtrim-interval is not explicitly set in cn=changelog5,cn=config

Look at cur_time. That is the "uptime" of the system, not UTC.

[root@linge ~]# uptime
  11:21:42 up 42 days, 20:49,  2 users,  load average: 0.15, 0.14, 0.16

Next, assuming this code from retrocl_trim.c is making the decision to trim

             first_time = retrocl_getchangetime(SLAPI_SEQ_FIRST, &ldrc);
             slapi_log_err(SLAPI_LOG_PLUGIN, RETROCL_PLUGIN_NAME,
                           "cltrim: ldrc=%d, first_time=%ld, cur_time=%ld\n",
                           ldrc, first_time, cur_time);
             if (LDAP_SUCCESS == ldrc && first_time > (time_t)0L &&
                 first_time + ts.ts_c_max_age < cur_time) {
                 must_trim = 1;
             }

There is a mixup of absolute (first_time) and relative (cur_time) time. Which of the two is wrong?

>
> # dsctl slapd-YOUR_INSTANCE stop
>
> # dsctl slapd-YOUR_INSTANCE db2index changelog
>
> # dsctl slapd-YOUR_INSTANCE start
>
> Then wait 5 minutes and see if anything was trimmed.
>
> Mark
>
>>
>> dn: cn=Retro Changelog Plugin,cn=plugins,cn=config
>> cn: Retro Changelog Plugin
>> nsslapd-attribute: nsuniqueid:targetUniqueId
>> nsslapd-changelogmaxage: 2d
>> nsslapd-include-suffix: cn=dns,dc=example,dc=com
>> nsslapd-plugin-depends-on-named: Class of Service
>> nsslapd-plugin-depends-on-type: database
>> nsslapd-pluginDescription: Retrocl Plugin
>> nsslapd-pluginEnabled: on
>> nsslapd-pluginId: retrocl
>> nsslapd-pluginInitfunc: retrocl_plugin_init
>> nsslapd-pluginPath: libretrocl-plugin
>> nsslapd-pluginType: object
>> nsslapd-pluginVendor: 389 Project
>> nsslapd-pluginVersion: 1.4.3.231
>> nsslapd-pluginbetxn: on
>> nsslapd-pluginprecedence: 25
>> objectClass: top
>> objectClass: nsSlapdPlugin
>> objectClass: extensibleObject
>>
>>>
>>>
>>> HTH,
>>>
>>> Mark
>>>
>>>>
>>>> The first entry in cn=changelog is more than two months old. (May 26th
>>>> is about
>>>> the time I installed this server as a FreeIPA replica.)
>>>>
>>>> dn: changenumber=1,cn=changelog
>>>> objectClass: top
>>>> objectClass: changelogentry
>>>> objectClass: extensibleObject
>>>> targetuniqueid: 1e89ebcb-e20111e8-8820f96e-fc0c60a4
>>>> changeNumber: 1
>>>> targetDn:
>>>> idnsname=_ldap._tcp,idnsname=example.com.,cn=dns,dc=example,dc=com
>>>> changeTime: 20210526122901Z
>>>> changeType: modify
>>>> changes:: YWRkOiBzUlZSZWNvc...gplbnRyeXVzbjogMTEyCi0KAA==
>>>>
>>>> Is there a way to see trimming in action? I'm afraid it is not doing
>>>> anything.
>>
>> What diagnostic can I enable to see the trimming in action?
>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I recommend setting up the changelog max age to 7 days:
>>>>>>>
>>>>>>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/administration_guide/trimming_the_replication_changelog
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Once that the trimming cleans up the changelog the database
>>>>>>> compaction
>>>>>>> should work without issue.  You can also increase the database
>>>>>>> locks to
>>>>>>> 1 million (that does require a restart of the server to take
>>>>>>> effect).
>>>>>>
>>>>>> Let's hope that 1 million is enough :-)
>>>>>
>>>>> It might not be, you can keep bumping it up if needed. There is no
>>>>> limit, or negative impact on db performance.
>>>>>
>>>>> Mark
>>>>>
>>>>>>
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>> [1]
>>>>>> https://directory.fedoraproject.org/docs/389ds/FAQ/changelog-trimming.html
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Directory Server Development Team
>>>>>
>>>> _______________________________________________
>>>> 389-users mailing list -- 389-users@lists.fedoraproject.org
>>>> To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
>>>> Fedora Code of Conduct:
>>>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives:
>>>> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
>>>>
>>>> Do not reply to spam on the list, report it:
>>>> https://pagure.io/fedora-infrastructure
>>>
>>> --
>>> Directory Server Development Team
>>>
>>
> --
> Directory Server Development Team
>
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


--
--

389 Directory Server Development Team

No comments:

Post a Comment