The explanation below looks excellent to me. You may also have a look at https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/deployment_guide/designing_the_replication_process#doc-wrapper
Regarding the initial concern "having regular problems with missed replications". A key element is that there is no synchronous replication, an update is not sync immediately to all replicas. A LDAP client req an update on one replica (original replica) that will propagate the update to others replicas (themselves will be able to propagate it to a next replica ("hops")). So there may be a delay (replication lag) between the original update and the time the last replica will receive it. Usually the delay is few seconds but may depend on may factors.
As you noticed, updates are identified with CSN that are logged in access log. If you suspect that an update is missing, you need to check if the related CSN is present in the remote replicas access log files. note that access logs are buffered.
I'm not sure about doc, but the basic idea iirc is that a vector clock (called replica update vector) is constructed from the sequence numbers from each node. Therefore it isn't necessary to keep track of a list of CSNs, only compare them to determine if another node is caught up with, or behind the state for the sending node. Using this scheme, each node connects to each other and by asking the other node for its current ruv can determine which if any of the changes it has need to be propagated to the peer. These are sent as (almost) regular LDAP operations: add, modify, delete. The consumer server then decides how to process each operation such that consistency is preserved (all nodes converge to the same state). e.g. it might skip an update because the current state for the entry is ahead of the update. It's what nowadays would be called a CDRT scheme, but that term didn't exist when the DS was devloped.
On Wed, Nov 15, 2023, at 9:59 AM, William Faulk wrote:
I am running a RedHat IdM environment and am having regular problems with missed replications. I want to understand how it's supposed to work better so that I can make reasonable hypotheses to test, but I cannot seem to find any in-depth documentation for it. Every time I think I start to piece together an understanding, experimentation makes it fall apart. Can someone either point me to some documentation or help me understand how it works?
In particular, IdM implements multimaster replication, and I'm initially trying to understand how changes are replicated in that environment. What I think I understand is that changes beget CSNs, which are comprised of a timestamp and a replica ID, and some sort of comparison is made between the most recent CSNs in order to determine what changes need to be sent to the remote side. Does each replica keep a list of CSNs that have been sent to each other replica? Just the replicas that it peers with? Can I see this data? (I thought it might be in the nsds5replicationagreement entries, but the nsds50ruv values there don't seem to change.) But it feels like it doesn't keep that data, because then what would be the point of comparing the CSN values be? Anyway, these are the types of questions I'm looking to understand. Can anyone help, please?
389-users mailing list -- email@example.com
To unsubscribe send an email to firstname.lastname@example.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________ 389-users mailing list -- email@example.com To unsubscribe send an email to firstname.lastname@example.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://email@example.com Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue