Wednesday, February 28, 2024

[389-users] Re: Determining max CSN of running server

> On 29 Feb 2024, at 05:20, William Faulk <d4hgcdgdmj@liamekaens.com> wrote:
>
> I'm having another replication problem where changes made on a particular server are not being replicated outward at all. Right now, I'm trying to determine what's going on during the replication process.
>
> (Caveat: I'm still running an old version of 389ds: v1.3.10. In particular, the dsconf utility does not exist.)
>
> My understanding is that when a server receives a change from a client, it wraps it up as a CSN and starts a replication session with its peers, during which it sends a message that states the greatest CSN that it originated. First off, is that a correct understanding?

Might be worth re-reading https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org/thread/UYP4PYBVVDKGKZVZTC34JVXNUVP2VAVI/

It doesn't send a single CSN, the replication compares the RUVs and determines the range of CSNs that are missing from the consumer.

It's also not immediate. Between the server accepting a change (add, mod etc), the change is associated to a CSN. But then there may be a delay before the two nodes actually communicate and exchange data.

>
> If so, how can I determine what CSN a particular server is telling its replication peers during those sessions? I have a feeling that this server is, for some reason, sending an inaccurate number.

Generally you'd need replication logging (errorloglevel 8192). But it's very noisy and can be hard to read. What you need to see is the ranges that they agree to send.

Also remember CSN's are a monotonic lamport clock. This means they only ever advance and can never step backwards. So they have some different properties to what you may expect. If they ever go backwards I think the replication handler throws a pretty nasty error.

>
> In the cn=replica,cn=...,cn=mapping tree,cn=config tree, there are entries for each of the servers topology peers, and they contain nsds50ruv attributes that seem to be the RUVs that that server has received from those peers, right? But the nsds50ruv attribute also exists directly in the cn=replica if you explicitly ask for it. Is it possible that this is the server's own RUV?

I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of the server, and I think the other replica entries are "what the peers ruv was last time". But I think Thierry or Pierre would know more about that then me. Some of the replication monitoring code in newer versions does this for you, so I'd probably advise you attempt to upgrade your environment. 1.3 is really old at this point (And I'm not sure if even RH or SUSE still support that version anymore).

>
> Can I rely on the nsds50ruv attributes on this server's peers' cn=replica nsds50ruv attribute values to be an accurate reflection of what this server is sending as its CSN in replication sessions?
>
> Any other way to see what's going on in a replication session? (I'm even trying to decrypt a network capture, but I'm not having any luck with that yet.)
>
> In particular, I see the max CSN for this server in all of these RUVs less than CSNs recorded in the server's own log files.

The problem here is that to read the RUV's and then compare them, you need to read each RUV from each server and then check if they are advancing (not that they are equal). See, it's okay if RUV's are not the same between two servers, because that can simply indicate that a server has accepted a write and not yet sent it to another node. In fact it's common in busy environments that every server has "slightly different state" because they have to continually replicate and converge.

For example, imagine some user A changes their password. Now that change has to propogate and converge between all the nodes in the topology. While that convergence is occuring, then another user B could be changing their password. This can leave with servers where:

* A and B passwords are original
* A password is changed, B original
* A password origin, B changed
* A and B have been changed.

And all four of these states are valid!

If you want to assert that "Some change I made at CSN X is on all servers" then you would need to read and parse the ruv and ensure that all of them are at or past that CSN for that replica id.

Either way - it's not trivial :)


--
Sincerely,

William Brown

Senior Software Engineer,
Identity and Access Management
SUSE Labs, Australia
--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

No comments:

Post a Comment