> Might be worth re-reading
Well, I still don't really know the details of the replication process.
I have deduced that changes originated on a replica seem to prompt that replica to start a replication process with its peers, but I don't really know what happens then. There's a comparison of the RUVs of the two replicas, but does the initiating system send its RUV to the receiver, or does it go the other way, or do both happen? Does the comparison prompt the comparing system to send the changes it thinks the other system needs, or does it cause the comparing system to request new changes from the other? Maybe none of this really makes much difference, but the lack of technical detail around this makes me just question everything.
> It doesn't send a single CSN, the replication compares the RUVs and determines the
> range of CSNs that are missing from the consumer.
Sure, but notionally any changes that originated on that replica would be reflected in the max CSN for itself in the RUV that is used to compare. And at least one side is sending its RUV to the other during the replication process.
> It's also not immediate. Between the server accepting a change (add, mod etc), the
> change is associated to a CSN. But then there may be a delay before the two nodes actually
> communicate and exchange data.
Sure, but the changes originated on this replica haven't made it to other replicas in weeks. This isn't a mere delay in replication.
> Generally you'd need replication logging (errorloglevel 8192). But it's very noisy
> and can be hard to read. What you need to see is the ranges that they agree to send.
Okay. I've done that and haven't had a chance to pore through them yet.
> Also remember CSN's are a monotonic lamport clock. This means they only ever advance
> and can never step backwards. So they have some different properties to what you may
> expect. If they ever go backwards I think the replication handler throws a pretty nasty
> error.
I don't think it's going backwards. What I'm trying to rule out is that the replica is failing to advance its max CSN in the RUV being used to compare.
> I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of
> the server, and I think the other replica entries are "what the peers ruv was last
> time".
Well, it's at least nice to hear that my guess at least isn't asinine. :)
> replication monitoring code in newer versions does this for you, so I'd probably
> advise you attempt to upgrade your environment. 1.3 is really old at this point
I've been trying to get the current environment stable enough that I feel comfortable going through the relatively lengthy upgrade process. I think I'm going to have to adjust my comfort level.
> I'm not sure if even RH or SUSE still support that version anymore).
RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 months. They're working on this with me. I'm still just trying to understand the system better so that I can try to be productive while I'm waiting on them to come up with ideas.
> The problem here is that to read the RUV's and then compare them, you need to read
> each RUV from each server and then check if they are advancing (not that they are equal).
The problem is that the changes in my environment are few enough that all the replicas' RUVs _are_ equal the majority of the time. I'm not in front of that system as I respond right now, so my details might be wrong, but I'm asking about all of this because every RUV I see in all of the replicas is the same, and it shows a max CSN for this one replica that's much older than the CSNs I see it reference in the logs about changes originating on the replica. The CSNs I see in the logs when a new change is made are referencing the current time in them, while the max CSN I see in the RUVs is from 4 months ago.
Maybe it *did* go backwards somehow and that's why it's not working. Not that that would really help me understand what actually went wrong any better than I do now.
> If you want to assert that "Some change I made at CSN X is on all servers" then
> you would need to read and parse the ruv and ensure that all of them are at or past that
> CSN for that replica id.
Well, you'd think so. I've got that problem, too, where some CSNs just seem to get missed, but the max CSN in the RUV is well past that. But that's a different problem and not the one I'm working on now.
Thanks for the input.
--
William Faulk
--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
No comments:
Post a Comment