Tuesday, October 15, 2024

[389-users] Re: Inconsistent Ldap connection issues

Hi William,

These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".

We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes.  I just spliced up the logs to see what that first minute was like:


Peak Concurrent Connections:   689
Total Operations:              18770
Total Results:                 18769
Overall Performance:           100.0%

Total Connections:             2603          (21.66/sec)  (1299.40/min)
 - LDAP Connections:           2603          (21.66/sec)  (1299.40/min)
 - LDAPI Connections:          0             (0.00/sec)  (0.00/min)
 - LDAPS Connections:          0             (0.00/sec)  (0.00/min)
 - StartTLS Extended Ops:      2571          (21.39/sec)  (1283.42/min)

Searches:                      13596         (113.12/sec)  (6787.01/min)
Modifications:                 0             (0.00/sec)  (0.00/min)
Adds:                          0             (0.00/sec)  (0.00/min)
Deletes:                       0             (0.00/sec)  (0.00/min)
Mod RDNs:                      0             (0.00/sec)  (0.00/min)
Compares:                      0             (0.00/sec)  (0.00/min)
Binds:                         2603          (21.66/sec)  (1299.40/min)

With these settings below, the test results are in, they still get 1 ldap error per test.

net.ipv4.tcp_max_syn_backlog = 8192

net.core.somaxconn = 8192

Suggestions ? Should I bump these up more ?

Thanks,

Gary

On 10/14/24 20:42, William Brown wrote:

Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for 

We encountered 2 "Can not contact ldap server" errors during this run. 

2 cant contact ldap server errors in this run below.


These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log? 


After the run I bumped up these from 4096, 

net.ipv4.tcp_max_syn_backlog = 6144
net.core.somaxconn = 6144

Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)

Should I bump up the nsslapd-listen-backlog-size, net.ipv4.tcp_max_syn_backlog, net.core.somaxconn more ?


We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?



-- 
Sincerely,

William Brown

Senior Software Engineer,
Identity and Access Management
SUSE Labs, Australia

[Test-Announce]Fedora 41 is in final freeze

Hi all,

Today, 2024-10-15, is an important day on the Fedora Linux 41 schedule
[1], with significant cut-offs. Today we have the Final Freeze [2]
which started at 15:00 UTC. This means that only packages that fix
accepted blockers or freeze exception bugs [3][4][5] will be marked as
'stable' and included in the Final composes. Other builds will remain
in updates-testing until the Final release is approved, at which point
the Final freeze is lifted and packages can move to the 'updates'
repository. Pending updates will be pushed before the final release as
zero-day updates.

Regards,
Samyak Jain
Fedora Release Engineering

[1] https://fedorapeople.org/groups/schedule/f-41/f-41-key-tasks.html
[2] https://fedoraproject.org/wiki/Milestone_freezes
[3] https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process
[4] https://fedoraproject.org/wiki/QA:SOP_freeze_exception_bug_process
[5] https://qa.fedoraproject.org/blockerbugs/milestone/f41/final/buglist
--
_______________________________________________
test-announce mailing list -- test-announce@lists.fedoraproject.org
To unsubscribe send an email to test-announce-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test-announce@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

Monday, October 14, 2024

[389-users] Re: Inconsistent Ldap connection issues

Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for 

We encountered 2 "Can not contact ldap server" errors during this run. 

2 cant contact ldap server errors in this run below.


These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log? 


After the run I bumped up these from 4096, 

net.ipv4.tcp_max_syn_backlog = 6144
net.core.somaxconn = 6144

Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)

Should I bump up the nsslapd-listen-backlog-size, net.ipv4.tcp_max_syn_backlog, net.core.somaxconn more ?


We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?



-- 
Sincerely,

William Brown

Senior Software Engineer,
Identity and Access Management
SUSE Labs, Australia

[Bug 2318638] New: Missing Georgian translaton for WxWidgets

https://bugzilla.redhat.com/show_bug.cgi?id=2318638

Bug ID: 2318638
Summary: Missing Georgian translaton for WxWidgets
Product: Fedora Localization
Status: NEW
Component: l10n-requests
Assignee: noriko.mizumoto@gmail.com
Reporter: temuri.doghonadze@gmail.com
CC: diegobz@gmail.com, dimitris@glezos.com,
fedora@couf.be, piotrdrag@gmail.com,
trans@lists.fedoraproject.org
Target Milestone: ---
Classification: Fedora



Description of problem:

On git, there is ka.po in 3.2.6 at URL.
https://github.com/wxWidgets/wxWidgets/tree/v3.2.6/locale
However, I cannot see file in wxGTK-i18n rpm package. Any reason for
this please? Is there anything I can do to get file into rpm?

Version-Release number of selected component (if applicable):
WxGTK-3.2.6

How reproducible:

install WxGTK-i18n package, check for wx*.po in
/usr/share/locale/ka/LC_MESSAGES


--
You are receiving this mail because:
You are on the CC list for the bug.
https://bugzilla.redhat.com/show_bug.cgi?id=2318638

Report this comment as SPAM: https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202318638%23c0

--
_______________________________________________
trans mailing list -- trans@lists.fedoraproject.org
To unsubscribe send an email to trans-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/trans@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

[Bug 2318617] New: Cracklib lacking Georgian translation

https://bugzilla.redhat.com/show_bug.cgi?id=2318617

Bug ID: 2318617
Summary: Cracklib lacking Georgian translation
Product: Fedora Localization
Status: NEW
Component: l10n-requests
Assignee: noriko.mizumoto@gmail.com
Reporter: temuri.doghonadze@gmail.com
CC: diegobz@gmail.com, dimitris@glezos.com,
fedora@couf.be, piotrdrag@gmail.com,
trans@lists.fedoraproject.org
Target Milestone: ---
Classification: Fedora



Description of problem:
No localization file for Georgian even in source rpm, while git has it
https://github.com/cracklib/cracklib/tree/v2.9.11/src/po

Version-Release number of selected component (if applicable):
cracklib 2.9.11-6


How reproducible:


Steps to Reproduce:
1. Install cracklib, check for cracklib.mo in /usr/share/locale/ka/LC_MESSAGES
2.
3.

Actual results:
no translation

Expected results:


Additional info:


--
You are receiving this mail because:
You are on the CC list for the bug.
https://bugzilla.redhat.com/show_bug.cgi?id=2318617

Report this comment as SPAM: https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202318617%23c0

--
_______________________________________________
trans mailing list -- trans@lists.fedoraproject.org
To unsubscribe send an email to trans-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/trans@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: Inconsistent Ldap connection issues

JC: There has been a lot of work around connection management around this version, the "accept" thread that is responsible for polling the ldap/ldaps ports for connection attempts was introduced in a previous version. This thread will fail to accept a new connection when the number of existing connections hits a certain limit. However, when this happens you should see this message in the error logs: "Not listening for new connections - too many fds open", so IMHO I think the error message you see is related to something else. 
I have the 6 minute log when the event occurs, but I am not seeing "Not listening" or "fds" in the logs.

JC: I failed to reproduce this issue locally with the same version and file descriptor limit (8192) and 10k connections. So I am not aware of anything else that might cause this error message.
How are you testing this ? Can you give me your load testing script??? I have used the "ldclt" script, and I see it does things, but based on the output I think its doing add/delete/modify/searches, but I cant tell how many connects and disconnects. If I had a script, to emulate this, I can see if I can get the developer or maybe I tweak their bonsai code.

Thanks so much,

Gary


On 10/9/24 09:24, James Chapman wrote:


On Wed, Oct 9, 2024 at 11:36 AM Joe Fletcher <jfletche@sscinc.com> wrote:

Just a thought: what are the open files limits set to on the system?

 

I've seen plenty of cases where increasing descriptor limits can help in these instances.

 

 

 

From: Thierry Bordaz via 389-users <389-users@lists.fedoraproject.org>
Sent: Wednesday, October 9, 2024 10:30 AM
To: Gary Waters <gwaters@caltech.edu>; General discussion list for the 389 Directory server project. <389-users@lists.fedoraproject.org>; Marc Sauton <msauton@redhat.com>; James Chapman <jachapma@redhat.com>
Cc: gwaters-web@caltech.edu; Thierry Bordaz <tbordaz@redhat.com>
Subject: [389-users] Re: Inconsistent Ldap connection issues

 

**** The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin. ****

 

Hi Gary,

My understanding is that a single client opens ~10K connections in ~6min, that is an average of 26 connections per second. It creates a set of established connections that spike up to 961 parallel connections. The problem being that from time to time the client reports 'can not contact LDAP server' and while you are running 2.4 this problem did not occur in 1.4.3. Am I correct ?

My first guess is that Marc is correct and tuning of tcp param should be the culprit. Now it does not explain why it occurs only in 2.4 but may be you were running 1.4.3 on RHEL7 and are running 2.4 on RHEL8.

2.4 contains significant changes around that area. I do not recall all commit in connection handling side. AFAIK what is "new" is a dedicated listener thread (polling on ports 389/636/ldapi for new connections) and a multiple polling threads (polling the established connections)
My understanding is that your problem impacts the listener thread as it is responsible to accept new connections and it is the new problem you detect in 2.4. @james did I miss something ?

JC: There has been a lot of work around connection management around this version, the "accept" thread that is responsible for polling the ldap/ldaps ports for connection attempts was introduced in a previous version. This thread will fail to accept a new connection when the number of existing connections hits a certain limit. However, when this happens you should see this message in the error logs: "Not listening for new connections - too many fds open", so IMHO I think the error message you see is related to something else. 

If the culprit is not in TCP tuning, @james do you remember known bugs that would limit the listener thread ?

JC: I failed to reproduce this issue locally with the same version and file descriptor limit (8192) and 10k connections. So I am not aware of anything else that might cause this error message.

best regards
thierry

On 10/8/24 7:36 PM, Gary Waters wrote:

Hi Thierry,

Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for

We encountered 2 "Can not contact ldap server" errors during this run.

2 cant contact ldap server errors in this run

----------- Access Log Output ------------

Start of Logs:    08/Oct/2024:09:53:35.810833927
End of Logs:      08/Oct/2024:09:59:52.361830449

Processed Log Time:  0 Hours, 6 Minutes, 16.550998016 Seconds

Restarts:                      1
Secure Protocol Versions:
  - TLS1.2 128-bit AES-GCM (9833 connections)

Peak Concurrent Connections:   689
Total Operations:              86412
Total Results:                 86412
Overall Performance:           100.0%

Total Connections:             9933          (26.38/sec)  (1582.73/min)
 - LDAP Connections:           9933          (26.38/sec)  (1582.73/min)
 - LDAPI Connections:          0             (0.00/sec)  (0.00/min)
 - LDAPS Connections:          0             (0.00/sec)  (0.00/min)
 - StartTLS Extended Ops:      9833          (26.11/sec)  (1566.80/min)

Searches:                      66647         (176.99/sec)  (10619.60/min)
Modifications:                 0             (0.00/sec)  (0.00/min)
Adds:                          0             (0.00/sec)  (0.00/min)
Deletes:                       0             (0.00/sec)  (0.00/min)
Mod RDNs:                      0             (0.00/sec)  (0.00/min)
Compares:                      0             (0.00/sec)  (0.00/min)
Binds:                         9932          (26.38/sec)  (1582.57/min)

Average wtime (wait time):     0.001407368
Average optime (op time):      0.003186859
Average etime (elapsed time):  0.004591048

Multi-factor Authentications:  0
Proxied Auth Operations:       0
Persistent Searches:           0
Internal Operations:           0
Entry Operations:              0
Extended Operations:           9833
Abandoned Requests:            0
Smart Referrals Received:      0

VLV Operations:                0
VLV Unindexed Searches:        0
VLV Unindexed Components:      0
SORT Operations:               0

Entire Search Base Queries:    0
Paged Searches:                0
Unindexed Searches:            0
Unindexed Components:          0
Invalid Attribute Filters:     0
FDs Taken:                     9933
FDs Returned:                  9932
Highest FD Taken:              961

Broken Pipes:                  0
Connections Reset By Peer:     0
Resource Unavailable:          0
Max BER Size Exceeded:         0

Binds:                         9932
Unbinds:                       9225
-----------------------------------
 - LDAP v2 Binds:              0
 - LDAP v3 Binds:              9932
 - AUTOBINDs(LDAPI):           0
 - SSL Client Binds:           0
 - Failed SSL Client Binds:    0
 - SASL Binds:                 0
 - Directory Manager Binds:    0
 - Anonymous Binds:            99

================

After this run I bumped up these from 4096,

net.ipv4.tcp_max_syn_backlog = 6144
net.core.somaxconn = 6144

Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)

Should I bump up the

On 10/8/24 03:47, Thierry Bordaz wrote:

Hi Gary,

I meant that the access logs covered 5hours. It would be helpful to capture/focus on the logs from the few minutes before/after the time when the problem occurred. Then check from those limited logs if there is a pattern or unexpected events (long operation, no operation, abandon, ...)

best regards
thierry

On 10/7/24 7:37 PM, Gary Waters wrote:

Hi Thierry,

Ok I ll decrease the timeout to 15 seconds then.

Reducing the size of the logs will help.

Which log and how do I do this ?

Thanks Marc and Theirry!

-Gary

On 10/7/24 00:26, Thierry Bordaz wrote:

Hi,

Those slap_poll error means that the server was unable to send back PDU to the client. It can occur if the client sends a request and does not read fast enough the results. The timeout is high 30s (30000), could it be that the problem is on the client side (app) ?

I suggest that you focus on the timestamp when the application reports a failure. Then look in the access/error logs from 1-3min before and after the time of the failure. Logconv from that limited scope will be more helpful than a global one.

The pattern looks to be an app opens a connection, switch to secure connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks fine but as it is an average (over 1M op) it is not helpful. Reducing the size of the logs will help.
I found interesting the abandon op as it is possibly related to a performance issue.

best regards
thierry

 

On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:

Hi Marc,

I have made nsslapd-listen-backlog-size to 512.

For the ioblocktimeout, I increased it because of an error I was seeing:

[30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed out
[30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed out

Googling stated that I should increase the ioblocktimeout. So I bumped it up from 20000 to 30000.

Since then, those slapd_poll timed out errors have not occurred. Should I have changed something else?

What should I increase these to?

net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096

Thanks so much for your help!

-Gary

On 10/4/24 11:55, Marc Sauton wrote:

tune up nsslapd-listen-backlog-size

and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are high enough ( sysctl -a )

possibly tune down the nsslapd-ioblocktimeout value

Thanks,

M.

 

On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users <389-users@lists.fedoraproject.org> wrote:

Hello,

We are experiencing a new issue since we upgraded from 389-ds-base from
1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I
switched to RHEL9 and are on 2.4.5-9.

The issue is during a performance load test of a web application. The
app logs into a website and does some things that searches against ldap,
and does some transactions. This app has been performing fine for years,
the app has changed so it could be something there, but I am not sure
about that because of the percentage of the traffic that is successful.

The errors for the web app are "Can't contact Ldap Server" and sometimes
"Can't contact LDAP server. Start TLS request accepted.Server willing to
negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these
errors will happen like 5 or 6 times, so its wildly inconsistent and random.

I did a logconv analysis with 6 hours of a day of testing, see below. 
One thing that really stood out to me was the peak concurrent
connections = 22.. That peak is so low, I dont know how these errors are
happening.

I dont see any errors in the access log ( grepping for err=1).
I looked for cache warnings/errors in the access/errors logs, but didnt
find any. I dont see things like unavailable connections in the access logs.

Suggestions on what to change or look for in the logs ?

Thanks,
Gary


information:
Machine Size: 16G of ram, 4 core AMD  (its an EC2.m5.large, gp3 disk type)

kernel:
Linux  5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC
packages:
389-ds-base-libs-2.4.5-9.el9_4.x86_64
389-ds-base-2.4.5-9.el9_4.x86_64

single instance of dirsrv running
dirsrv modifcations from default:

nsslapd-logging-backend: dirsrv-log,syslog
nsslapd-maxdescriptors: 8192
nsslapd-listen-backlog-size: 256
nsslapd-allow-hashed-passwords: on
nsslapd-idletimeout: 30
nsslapd-ioblocktimeout: 30000
nsslapd-sizelimit: -1
nsslapd-auditlog-logging-enabled: off
nsslapd-lookthroughlimit: -1

dirsrv.systemd:
limitNOFILE=8192

 >Total Log Lines Analszed:  2694287
 >
 >
 >
 > ---------- Access Log Output ------------
 >
 > Start of Logs:    26/Sep/2024:10:07:32.089983378
 > End of Logs:      26/Sep/2024:15:54:29.895403688
 >
 > Processed Log Time:  5 Hours, 46 Minutes, 57.805426688 Seconds
 >
 > Restarts:                      0
 > Secure Protocol Versions:
 >   - TLS1.2 128-bit AES-GCM (123117 connections)
 >
 > Peak Concurrent Connections:   22
 > Total Operations:              1097043
 > Total Results:                 1097044
 > Overall Performance:           100.0%
 >
 > Total Connections:             128646        (6.18/sec) (370.78/min)
 >  - LDAP Connections:           128646        (6.18/sec) (370.78/min)
 >  - LDAPI Connections:          0             (0.00/sec) (0.00/min)
 >  - LDAPS Connections:          0             (0.00/sec) (0.00/min)
 >  - StartTLS Extended Ops:      123116        (5.91/sec) (354.84/min)
 >
 > Searches:                      845279        (40.60/sec) (2436.22/min)
 > Modifications:                 0             (0.00/sec) (0.00/min)
 > Adds:                          0             (0.00/sec) (0.00/min)
 > Deletes:                       0             (0.00/sec) (0.00/min)
 > Mod RDNs:                      0             (0.00/sec) (0.00/min)
 > Compares:                      0             (0.00/sec) (0.00/min)
 > Binds:                         128647        (6.18/sec) (370.78/min)
 >
 > Average wtime (wait time):     0.001560856
 > Average optime (op time):      0.003310453
 > Average etime (elapsed time):  0.004868040
 >
 > Multi-factor Authentications:  0
 > Proxied Auth Operations:       0
 > Persistent Searches:           0
 > Internal Operations:           0
 > Entry Operations:              0
 > Extended Operations:           123116
 > Abandoned Requests:            1
 > Smart Referrals Received:      0
 >
 > VLV Operations:                0
 > VLV Unindexed Searches:        0
 > VLV Unindexed Components:      0
 > SORT Operations:               0
 >
 > Entire Search Base Queries:    0
 > Paged Searches:                0
 > Unindexed Searches:            0
 > Unindexed Components:          0
 > Invalid Attribute Filters:     0
 > FDs Taken:                     128646
 > FDs Returned:                  129318
 > Highest FD Taken:              968
 >
 > Broken Pipes:                  0
 > Connections Reset By Peer:     0
 > Resource Unavailable:          0
 > Max BER Size Exceeded:         0
 >
 > Binds:                         128647
 > Unbinds:                       119206
 > -------------------------------------
 >  - LDAP v2 Binds:              0
 >  - LDAP v3 Binds:              128647
 >  - AUTOBINDs(LDAPI):           0
 >  - SSL Client Binds:           0
 >  - Failed SSL Client Binds:    0
 >  - SASL Binds:                 0
 >  - Dir

--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue



[389-users] Re: Inconsistent Ldap connection issues

cat /proc/sys/fs/file-max
9223372036854775807

The file descriptor limit of dirsrv is 8192. Should I bump this up too? I just used what was on the 389 website.

Thanks,

Gary

On 10/9/24 03:06, Joe Fletcher wrote:

Just a thought: what are the open files limits set to on the system?

 

I've seen plenty of cases where increasing descriptor limits can help in these instances.

 

 

 

From: Thierry Bordaz via 389-users <389-users@lists.fedoraproject.org>
Sent: Wednesday, October 9, 2024 10:30 AM
To: Gary Waters <gwaters@caltech.edu>; General discussion list for the 389 Directory server project. <389-users@lists.fedoraproject.org>; Marc Sauton <msauton@redhat.com>; James Chapman <jachapma@redhat.com>
Cc: gwaters-web@caltech.edu; Thierry Bordaz <tbordaz@redhat.com>
Subject: [389-users] Re: Inconsistent Ldap connection issues

 

**** The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin. ****

 

Hi Gary,

My understanding is that a single client opens ~10K connections in ~6min, that is an average of 26 connections per second. It creates a set of established connections that spike up to 961 parallel connections. The problem being that from time to time the client reports 'can not contact LDAP server' and while you are running 2.4 this problem did not occur in 1.4.3. Am I correct ?

My first guess is that Marc is correct and tuning of tcp param should be the culprit. Now it does not explain why it occurs only in 2.4 but may be you were running 1.4.3 on RHEL7 and are running 2.4 on RHEL8.

2.4 contains significant changes around that area. I do not recall all commit in connection handling side. AFAIK what is "new" is a dedicated listener thread (polling on ports 389/636/ldapi for new connections) and a multiple polling threads (polling the established connections)
My understanding is that your problem impacts the listener thread as it is responsible to accept new connections and it is the new problem you detect in 2.4. @james did I miss something ?

If the culprit is not in TCP tuning, @james do you remember known bugs that would limit the listener thread ?

best regards
thierry

On 10/8/24 7:36 PM, Gary Waters wrote:

Hi Thierry,

Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for

We encountered 2 "Can not contact ldap server" errors during this run.

2 cant contact ldap server errors in this run

----------- Access Log Output ------------

Start of Logs:    08/Oct/2024:09:53:35.810833927
End of Logs:      08/Oct/2024:09:59:52.361830449

Processed Log Time:  0 Hours, 6 Minutes, 16.550998016 Seconds

Restarts:                      1
Secure Protocol Versions:
  - TLS1.2 128-bit AES-GCM (9833 connections)

Peak Concurrent Connections:   689
Total Operations:              86412
Total Results:                 86412
Overall Performance:           100.0%

Total Connections:             9933          (26.38/sec)  (1582.73/min)
 - LDAP Connections:           9933          (26.38/sec)  (1582.73/min)
 - LDAPI Connections:          0             (0.00/sec)  (0.00/min)
 - LDAPS Connections:          0             (0.00/sec)  (0.00/min)
 - StartTLS Extended Ops:      9833          (26.11/sec)  (1566.80/min)

Searches:                      66647         (176.99/sec)  (10619.60/min)
Modifications:                 0             (0.00/sec)  (0.00/min)
Adds:                          0             (0.00/sec)  (0.00/min)
Deletes:                       0             (0.00/sec)  (0.00/min)
Mod RDNs:                      0             (0.00/sec)  (0.00/min)
Compares:                      0             (0.00/sec)  (0.00/min)
Binds:                         9932          (26.38/sec)  (1582.57/min)

Average wtime (wait time):     0.001407368
Average optime (op time):      0.003186859
Average etime (elapsed time):  0.004591048

Multi-factor Authentications:  0
Proxied Auth Operations:       0
Persistent Searches:           0
Internal Operations:           0
Entry Operations:              0
Extended Operations:           9833
Abandoned Requests:            0
Smart Referrals Received:      0

VLV Operations:                0
VLV Unindexed Searches:        0
VLV Unindexed Components:      0
SORT Operations:               0

Entire Search Base Queries:    0
Paged Searches:                0
Unindexed Searches:            0
Unindexed Components:          0
Invalid Attribute Filters:     0
FDs Taken:                     9933
FDs Returned:                  9932
Highest FD Taken:              961

Broken Pipes:                  0
Connections Reset By Peer:     0
Resource Unavailable:          0
Max BER Size Exceeded:         0

Binds:                         9932
Unbinds:                       9225
-----------------------------------
 - LDAP v2 Binds:              0
 - LDAP v3 Binds:              9932
 - AUTOBINDs(LDAPI):           0
 - SSL Client Binds:           0
 - Failed SSL Client Binds:    0
 - SASL Binds:                 0
 - Directory Manager Binds:    0
 - Anonymous Binds:            99

================

After this run I bumped up these from 4096,

net.ipv4.tcp_max_syn_backlog = 6144
net.core.somaxconn = 6144

Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)

Should I bump up the

On 10/8/24 03:47, Thierry Bordaz wrote:

Hi Gary,

I meant that the access logs covered 5hours. It would be helpful to capture/focus on the logs from the few minutes before/after the time when the problem occurred. Then check from those limited logs if there is a pattern or unexpected events (long operation, no operation, abandon, ...)

best regards
thierry

On 10/7/24 7:37 PM, Gary Waters wrote:

Hi Thierry,

Ok I ll decrease the timeout to 15 seconds then.

Reducing the size of the logs will help.

Which log and how do I do this ?

Thanks Marc and Theirry!

-Gary

On 10/7/24 00:26, Thierry Bordaz wrote:

Hi,

Those slap_poll error means that the server was unable to send back PDU to the client. It can occur if the client sends a request and does not read fast enough the results. The timeout is high 30s (30000), could it be that the problem is on the client side (app) ?

I suggest that you focus on the timestamp when the application reports a failure. Then look in the access/error logs from 1-3min before and after the time of the failure. Logconv from that limited scope will be more helpful than a global one.

The pattern looks to be an app opens a connection, switch to secure connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks fine but as it is an average (over 1M op) it is not helpful. Reducing the size of the logs will help.
I found interesting the abandon op as it is possibly related to a performance issue.

best regards
thierry

 

On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:

Hi Marc,

I have made nsslapd-listen-backlog-size to 512.

For the ioblocktimeout, I increased it because of an error I was seeing:

[30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed out
[30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed out

Googling stated that I should increase the ioblocktimeout. So I bumped it up from 20000 to 30000.

Since then, those slapd_poll timed out errors have not occurred. Should I have changed something else?

What should I increase these to?

net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096

Thanks so much for your help!

-Gary

On 10/4/24 11:55, Marc Sauton wrote:

tune up nsslapd-listen-backlog-size

and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are high enough ( sysctl -a )

possibly tune down the nsslapd-ioblocktimeout value

Thanks,

M.

 

On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users <389-users@lists.fedoraproject.org> wrote:

Hello,

We are experiencing a new issue since we upgraded from 389-ds-base from
1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I
switched to RHEL9 and are on 2.4.5-9.

The issue is during a performance load test of a web application. The
app logs into a website and does some things that searches against ldap,
and does some transactions. This app has been performing fine for years,
the app has changed so it could be something there, but I am not sure
about that because of the percentage of the traffic that is successful.

The errors for the web app are "Can't contact Ldap Server" and sometimes
"Can't contact LDAP server. Start TLS request accepted.Server willing to
negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these
errors will happen like 5 or 6 times, so its wildly inconsistent and random.

I did a logconv analysis with 6 hours of a day of testing, see below. 
One thing that really stood out to me was the peak concurrent
connections = 22.. That peak is so low, I dont know how these errors are
happening.

I dont see any errors in the access log ( grepping for err=1).
I looked for cache warnings/errors in the access/errors logs, but didnt
find any. I dont see things like unavailable connections in the access logs.

Suggestions on what to change or look for in the logs ?

Thanks,
Gary


information:
Machine Size: 16G of ram, 4 core AMD  (its an EC2.m5.large, gp3 disk type)

kernel:
Linux  5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC
packages:
389-ds-base-libs-2.4.5-9.el9_4.x86_64
389-ds-base-2.4.5-9.el9_4.x86_64

single instance of dirsrv running
dirsrv modifcations from default:

nsslapd-logging-backend: dirsrv-log,syslog
nsslapd-maxdescriptors: 8192
nsslapd-listen-backlog-size: 256
nsslapd-allow-hashed-passwords: on
nsslapd-idletimeout: 30
nsslapd-ioblocktimeout: 30000
nsslapd-sizelimit: -1
nsslapd-auditlog-logging-enabled: off
nsslapd-lookthroughlimit: -1

dirsrv.systemd:
limitNOFILE=8192

 >Total Log Lines Analszed:  2694287
 >
 >
 >
 > ---------- Access Log Output ------------
 >
 > Start of Logs:    26/Sep/2024:10:07:32.089983378
 > End of Logs:      26/Sep/2024:15:54:29.895403688
 >
 > Processed Log Time:  5 Hours, 46 Minutes, 57.805426688 Seconds
 >
 > Restarts:                      0
 > Secure Protocol Versions:
 >   - TLS1.2 128-bit AES-GCM (123117 connections)
 >
 > Peak Concurrent Connections:   22
 > Total Operations:              1097043
 > Total Results:                 1097044
 > Overall Performance:           100.0%
 >
 > Total Connections:             128646        (6.18/sec) (370.78/min)
 >  - LDAP Connections:           128646        (6.18/sec) (370.78/min)
 >  - LDAPI Connections:          0             (0.00/sec) (0.00/min)
 >  - LDAPS Connections:          0             (0.00/sec) (0.00/min)
 >  - StartTLS Extended Ops:      123116        (5.91/sec) (354.84/min)
 >
 > Searches:                      845279        (40.60/sec) (2436.22/min)
 > Modifications:                 0             (0.00/sec) (0.00/min)
 > Adds:                          0             (0.00/sec) (0.00/min)
 > Deletes:                       0             (0.00/sec) (0.00/min)
 > Mod RDNs:                      0             (0.00/sec) (0.00/min)
 > Compares:                      0             (0.00/sec) (0.00/min)
 > Binds:                         128647        (6.18/sec) (370.78/min)
 >
 > Average wtime (wait time):     0.001560856
 > Average optime (op time):      0.003310453
 > Average etime (elapsed time):  0.004868040
 >
 > Multi-factor Authentications:  0
 > Proxied Auth Operations:       0
 > Persistent Searches:           0
 > Internal Operations:           0
 > Entry Operations:              0
 > Extended Operations:           123116
 > Abandoned Requests:            1
 > Smart Referrals Received:      0
 >
 > VLV Operations:                0
 > VLV Unindexed Searches:        0
 > VLV Unindexed Components:      0
 > SORT Operations:               0
 >
 > Entire Search Base Queries:    0
 > Paged Searches:                0
 > Unindexed Searches:            0
 > Unindexed Components:          0
 > Invalid Attribute Filters:     0
 > FDs Taken:                     128646
 > FDs Returned:                  129318
 > Highest FD Taken:              968
 >
 > Broken Pipes:                  0
 > Connections Reset By Peer:     0
 > Resource Unavailable:          0
 > Max BER Size Exceeded:         0
 >
 > Binds:                         128647
 > Unbinds:                       119206
 > -------------------------------------
 >  - LDAP v2 Binds:              0
 >  - LDAP v3 Binds:              128647
 >  - AUTOBINDs(LDAPI):           0
 >  - SSL Client Binds:           0
 >  - Failed SSL Client Binds:    0
 >  - SASL Binds:                 0
 >  - Dir

--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue