Tuesday, April 18, 2023

[389-users] A more profound replication monitoring of 389-ds instance

Hello everyone,

I have a request for advice on how to approach monitoring of replication in an environment with approximately 30 FreeIPA servers, all in a master-master replication agreement, using 389-ds (389-ds-base-1.4.3.28-6). I am currently looking for ways to reduce the number of replicas (because there are more to come) and need to justify it to the architecture department with evidence based on experimental observations.

The problem we are facing is that our installation has started experiencing lags in some operations, such as adding user groups, HBAC, and SUDO rules and the most heaviest (by the impact) is automember-rebuild operation.
The number of entities being added is not large, with a maximum of 10 groups and several sudo and HBAC rules, though for automember-rebuild I don't know for certain cause for now I didn't figure out what operations are done internally by this. The "lag" manifests as latency in LDAP operations, leading to timeouts, which in turn causes some services that rely on Kerberos or DNS (because FreeIPA uses LDAP directory for everything) to go down. Our monitoring system also shows that the outage propagates through replicas as replication progresses.

The classic approach of monitoring replication agreements through the nsds5replicaLastUpdateStatus attribute is not sufficient. We need a more dynamic approach that can show the "waves" or replication sessions throughout the environment, which can help in further tuning replication parameters.

I am facing the following problems:

1) The only way to get full replication information currently is to turn on full debug for error logs. While this can be done in test environments, I cannot rely on it in production. I thought that BPF could be the answer, but I am not sure if dirsrv has internal support (predefined probe points) for it. Has anyone from the developers tried to use BPF to monitor some features in 389-ds?

2) Regardless of BPF support, I can still try to implement monitoring with it, in conjunction with debug symbols. However, another problem is that I do not know the exact algorithm of the replication process. I have read this article (https://www.port389.org/docs/389ds/design/replication_troubleshooting.html), but it is still obscure for my purposes. Can you shed some light on the approach I should take here? In my mind, the first step should be very basic - attach to a set of consumer level functions responsible for receiving replica updates, and monitor the latency, the amount of incoming connections at a given point in time, and so on. But if you could point me in the right direction (other than just directly pointing to the repository and searching the source code), I would greatly appreciate it.

3) This feature (https://directory.fedoraproject.org/docs/389ds/design/log-operation-stats.html) is not supported for my version of 389-ds, is it? Is there a way to patch my version to support it?

Thank you in advance for your help.
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

No comments:

Post a Comment