Intermittent Login Failures

3 minute read

Over the past several weeks I’ve been working on an issue that I think all IT professionals dread, an intermittent issue. The specific scenario in this case was that the application was experiencing an error code 17830 at random intervals. After working on and off on the issue I was able to work with a member of the team to identify an exact timestamp of one of the issues.

Once I had this event in hand something struck me that SQL Server since at least 2012 always has an extended event running on the server that writes to both a ring buffer and file. This combination will allow CSS to perform diagnostics around specific important event classes. I’ve used these in the past when approaching a server cold where monitoring tools are unavailable or not granular enough to begin diagnosing an issue but mostly for CPU and memory pressure.

When I opened the system health event file on SQL Server 2017, I discovered that any severity 20 or higher is captured. Login failures, specifically the error code that I was dealing with 17830 is a severity 20. I now had something that was capturing at least some history for me already. What I didn’t realize was just how much of a treasure trove of data was available here.

Should you be making use of resource governor or login triggers this health event could prove quite useful in debugging if your classifier function or trigger is delaying your login as there is a specific breakdown for the time there.

If you view the file data through SSMS I experienced some of the data being masked such as the last octet of client IP addresses. By going directly to the ring buffer or file this masking can be bypassed.

This provides some very helpful details such as the error code passed to the client along with where the time was spent on the login. Here is just a sample of information you can gather from the System Health event for this error:

  • Source IP Address and Port
  • Destination IP Address and Port
  • Client Error Code(s)
  • Client Error State Information *
  • Time Spent During Login
    • Total Time
    • TLS/SSL Time
    • Network Time (Inbound/Outbound)
    • SSPI Time
    • Login Trigger Execution Time
    • Classifier Function Execution Time
    • Find Login Time
    • Session Recovery Time

The top two items in the list can prove invaluable if your network team needs to become involved. This can be used for route tracing purposes along with the timestamps to help determine if there was a specific issue at that time. Some networks will track dropped traffic and with source/destination information this can allow for better diagnostics should there be a bad route.

There are also cases where the client connection may not be able to report the error received or the logs may not have this information available. With the ability to diagnose on the SQL Server end what the client received this can aid in troubleshooting in cases where limited information is available.

For the specific case that I was working with I made heavy use of the timings. What I discovered was that I was able to correlate this information to other perfmon counters. This helped in narrowing the cause towards NIC drivers being problematic

Below is a version of the query used for diagnostics against the ring_buffer. With some effort this could be modified to use the file target, though the XML structure for the file target is quite different.

select a.* FROM
(select
x.value('(//Record/ConnectivityTraceRecord/RecordType)[1]', 'varchar(30)') AS [RecordType],
x.value('(//Record/ConnectivityTraceRecord/Spid)[1]', 'int') AS [Spid], 
x.value('(//Record/ConnectivityTraceRecord/OSError)[1]', 'int') AS [OSError],
x.value('(//Record/ConnectivityTraceRecord/SniConsumerError)[1]', 'int') AS [SniConsumerError],
x.value('(//Record/ConnectivityTraceRecord/SniProvider)[1]', 'int') AS [SniProvider],
x.value('(//Record/ConnectivityTraceRecord/State)[1]', 'int') AS [State], 
convert(datetime2(3), x.value('(//Record/ConnectivityTraceRecord/RecordTime)[1]', 'datetime')) at time zone 'UTC' at time zone 'Eastern Standard Time' AS [RecordTime],
x.value('(//Record/ConnectivityTraceRecord/RemoteHost)[1]', 'varchar(40)') AS [RemoteHost],
x.value('(//Record/ConnectivityTraceRecord/RemotePort)[1]', 'int') AS [RemotePort],
x.value('(//Record/ConnectivityTraceRecord/LocalHost)[1]', 'varchar(40)') AS [LocalHost],
x.value('(//Record/ConnectivityTraceRecord/LocalPort)[1]', 'int') AS [LocalPort],
x.value('(//Record/ConnectivityTraceRecord/TdsBufInfo/InputBufError)[1]', 'int') AS [TdsInputBufferError],
x.value('(//Record/ConnectivityTraceRecord/TdsBufInfo/OutputBufError)[1]', 'int') AS [TdsOutputBufferError],
x.value('(//Record/ConnectivityTraceRecord/TdsBufInfo/InputBufBytes)[1]', 'int') AS [TdsInputBufferBytes],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TotalTime)[1]', 'int') AS [TotalLoginTimeInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/EnqueueTime)[1]', 'int') AS [LoginTaskEnqueuedInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/NetWritesTime)[1]', 'int') AS [NetworkWritesInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/NetReadsTime)[1]', 'int') AS [NetworkReadsInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Ssl/TotalTime)[1]', 'int') AS [SslTotalProcessingInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Ssl/NetReadsTime)[1]', 'int') AS [SslNetReadsInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Ssl/NetWritesTime)[1]', 'int') AS [SslNetWritesInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Ssl/SecureCallsTime)[1]', 'int') AS [SslSecureCallsInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Ssl/EnqueueTime)[1]', 'int') AS [SslEnqueueInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Sspi/TotalTime)[1]', 'int') AS [SspiProcessingInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Sspi/NetReadsTime)[1]', 'int') AS [SspiNetReadsInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Sspi/NetWritesTime)[1]', 'int') AS [SspiNetWritesInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Sspi/SecureCallsTime)[1]', 'int') AS [SspiSecureCallsInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/Sspi/EnqueueTime)[1]', 'int') AS [SspiEnqueueInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TriggerAndResGovTime/TotalTime)[1]', 'int') AS [LoginTriggerAndResourceGovernorProcessingInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TriggerAndResGovTime/FindLogin)[1]', 'int') AS [LoginTriggerAndResourceGovernorFindLoginInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TriggerAndResGovTime/LogonTriggers)[1]', 'int') AS [LoginTriggerAndResourceGovernorLogonTriggersInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TriggerAndResGovTime/ExecClassifier)[1]', 'int') AS [LoginTriggerAndResourceGovernorExecClassifierInMilliseconds],
x.value('(//Record/ConnectivityTraceRecord/LoginTimersInMilliseconds/TriggerAndResGovTime/SessionRecover)[1]', 'int') AS [LoginTriggerAndResourceGovernorSessionRecoverInMilliseconds],
x.query('/') as EventXML
FROM (SELECT CAST (record as xml) FROM sys.dm_os_ring_buffers 
WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY') AS R(x)) a
where a.RecordType = 'LoginTimers'
order by a.recordtime

* If someone has a reference for where one can find details about the state information returned in the error code, I am very interested in this. During my diagnostics I observed a few different states returned but was unable to locate any documentation about this.