Thursday, February 29, 2024

[389-users] Re: Determining max CSN of running server

Thanks, Pierre and Thierry.

After quite some time of poring over these debug logs, I've found some anomalies and they seem like they're matching up with the idea that the affected replica isn't updating its own RUV correctly.

The logs show a change being made, and it lists the CSN of the change. The first anomalies are here, but they probably aren't terribly significant. The CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the future from when the change was made and logged. Also, the next part of the CSN is supposed to be a serial number for when there are changes made during the same second of the timestamp. In the case I was looking at, that serial was 0xb231. I'm certain that this replica didn't record another 45000 changes in that second.

Then it shows the server committing the change to the changelog. It shows it "processing data" for over 16000 other CSNs, and it takes about 25 seconds to complete.

It then starts a replication session with the peer and prints out the peer's (consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for itself shows the maxCSN for itself with a timestamp from almost 4 months ago. It is greater than the maxCSN for itself in the consumer's RUV, though, by a little. (The replicagenerations are equal, though.)

It then claims to send 7 changes, all of which are skipped because "empty". It then claims that there are "No more updates to send" and releases the consumer and eventually closes the connection.

I like the idea that there's a list of pending operations that's blocking RUV updates. Is there any way for me to examine this list? That said, I do think it updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at does seem to reflect the increased maxCSN for the bad replica in the RUV I can see in the "mapping tree". I've tried to reproduce this small update, but haven't been able to yet.

I also have another replica that seems to be experiencing the same problem, and I've restarted it with no improvement in symptoms. It might be different, though. It doesn't look like it discarded its changelog.

I definitely don't relish reinitializing from this bad replica, though. I'd have to perform a rolling reinitialization throughout our whole environment, and it takes ages and a lot of effort.

--
William Faulk
--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

No comments:

Post a Comment