Skip to content

100% CPU spin on CancelAllRequests #1437

@raeanne-marks

Description

@raeanne-marks

Describe your environment. macOS

Steps to reproduce.
Customer hit it randomly, I have their spindump.

What is the expected behavior?
Not 100% CPU :)

What is the actual behavior?
100% CPU usage due to spin yield-loop in CancelAllRequests

Additional context.

Two problems:

  1. No lock on m_httpCallbacks — it reads m_httpCallbacks.empty() without holding m_httpCallbacksMtx, but every other access to m_httpCallbacks (in handleSendRequest and onHttpResponse) is guarded by that mutex. This is a data race (undefined behavior). On arm64 especially, the thread may never see the list become empty due to store/load ordering.

  2. The yield-loop can spin forever — CancelAllRequests() tells the HTTP client to cancel, but the callbacks are removed from m_httpCallbacks in onHttpResponse(), which is scheduled asynchronously via scheduleOnHttpResponse() → scheduleTask(). If the task dispatcher is blocked or shut down (e.g., PauseTransmission has stopped dispatching), the response handler never runs, m_httpCallbacks never empties, and this loop spins indefinitely — which is exactly what the spindump shows.

The call chain in the spindump confirms this: PauseTransmission() calls into cancelAllRequests(), but pausing transmission likely also stops or stalls the task dispatcher that would normally run onHttpResponse to drain the callback list. The yield-loop has no timeout or fallback, so it burns CPU forever.

A fix would need to either:
Use a condition variable signaled from onHttpResponse instead of polling
Add a timeout to the loop and forcibly clear callbacks after it expires
Ensure onHttpResponse can still execute even when transmission is paused (or drain synchronously)

Spindump:

CPU Time:         9.756s (42.6G cycles, 143.3G instructions, 0.30c/i, 4.37GHz avg)
Num threads:      7
Note:             2 idle work queue threads omitted

  Thread 0x3b1aee    1001 samples (1-1001)    priority 0-20 (base 20)    cpu time 9.755s (42.6G cycles, 143.3G instructions, 0.30c/i, 4.37GHz avg)
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]
<redacted>

        1001  sense::Telemetry::OneDsTelemetryClient::PauseTransmission() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/libtelemetry/1ds/OneDsTelemetryClient.cpp:348) [0x104c99f64]
          1001  Microsoft::Applications::Events::LogManagerImpl::PauseTransmission() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/LogManagerImpl.cpp:461) [0x104be8344]
            1001  std::__1::__function::__func<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2, std::__1::allocator<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2>, bool ()>::operator()() (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__functional/function.h:313) [0x104bda300]
              996   Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/http/HttpClientManager.cpp:153) [0x104c01174]
                801   swtch_pri + 8 (libsystem_kernel.dylib + 3208) [0x185a87c88] (running)
                193   swtch_pri + 8 (libsystem_kernel.dylib + 3208) [0x185a87c88]
                 *94    ??? (kernel.release.t6050 + 24180) [0xfffffe000b3dde74] (running)
                 *62    ??? (kernel.release.t6050 + 24180) [0xfffffe000b3dde74]
                   *59    ??? (kernel.release.t6050 + 2149104) [0xfffffe000b5e4af0] (running)
                   *3     ??? (kernel.release.t6050 + 2150656) [0xfffffe000b5e5100]
                     *2     ??? (kernel.release.t6050 + 2122228) [0xfffffe000b5de1f4]
                       *2     ??? (kernel.release.t6050 + 843264) [0xfffffe000b4a5e00]
                         *2     ??? (kernel.release.t6050 + 842980) [0xfffffe000b4a5ce4] (running)
                     *1     ??? (kernel.release.t6050 + 2122276) [0xfffffe000b5de224] (running)
                 *21    ??? (kernel.release.t6050 + 843532) [0xfffffe000b4a5f0c] (runnable)
                 *15    ??? (kernel.release.t6050 + 27000) [0xfffffe000b3de978] (running)
                 *1     ??? (kernel.release.t6050 + 27000) [0xfffffe000b3de978]
                   *1     ??? (kernel.release.t6050 + 843632) [0xfffffe000b4a5f70]
                     *1     ??? (kernel.release.t6050 + 2168000) [0xfffffe000b5e94c0]
                       *1     ??? (kernel.release.t6050 + 2148624) [0xfffffe000b5e4910]
                         *1     ??? (kernel.release.t6050 + 26064) [0xfffffe000b3de5d0]
                           *1     ??? (kernel.release.t6050 + 316424) [0xfffffe000b425408] (running)
                2     cthread_yield + 36 (libsystem_pthread.dylib + 15400) [0x185ac8c28] (running)
              3     cthread_yield + 44 (libsystem_pthread.dylib + 15408) [0x185ac8c30] (running)
              2     Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/http/HttpClientManager.cpp:152) [0x104c01168] (running)


  Thread 0x3b1aec    1001 samples (1-1001)    priority 20 (base 20)    last ran 1011.253s ago
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(void*), void*>>(void*) (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__thread/thread.h:209) [0x104c20c58]
        1001  Microsoft::Applications::Events::PlatformAbstraction::WorkerThread::threadFunc(void*) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/pal/WorkerThread.cpp:0) [0x104c20410]
          1001  Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/./pal/WorkerThread.hpp:55) [0x104c20924]
            1001  bool std::__1::condition_variable::wait_for[abi:ne190102]<long long, std::__1::ratio<1l, 1000l>, Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const::'lambda0'()>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l>> const&, Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const::'lambda0'()) (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__condition_variable/condition_variable.h:205) [0x104c20a54]
              1001  std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 104 (libc++.1.dylib + 26864) [0x1859e18f0]
                1001  <patched truncated backtrace>
                  1001  __psynch_cvwait + 8 (libsystem_kernel.dylib + 17676) [0x185a8b50c]
                   *1001  psynch_cvcontinue + 0 (pthread + 17948) [0xfffffe000b316abc]

  Thread 0x3b1aed    1001 samples (1-1001)    priority 20 (base 20)    last ran 22618.041s ago
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]
< redacted >
        1001  sense::Telemetry::OneDsTelemetryClient::LogEvent(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/libtelemetry/1ds/OneDsTelemetryClient.cpp:344) [0x104c99de8]
          1001  Microsoft::Applications::Events::Logger::LogEvent(Microsoft::Applications::Events::EventProperties const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/Logger.cpp:334) [0x104bedfac]
            1001  Microsoft::Applications::Events::Logger::submit(CsProtocol::Record&, Microsoft::Applications::Events::EventProperties const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/Logger.cpp:589) [0x104bf15cc]
              1001  Microsoft::Applications::Events::LogManagerImpl::sendEvent(Microsoft::Applications::Events::IncomingEventContext* const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/LogManagerImpl.cpp:726) [0x104be8f10]
                1001  std::__1::recursive_mutex::lock() + 16 (libc++.1.dylib + 29536) [0x1859e2360]
                  1001  _pthread_mutex_firstfit_lock_slow + 216 (libsystem_pthread.dylib + 6368) [0x185ac68e0]
                    1001  <patched truncated backtrace>
                      1001  __psynch_mutexwait + 8 (libsystem_kernel.dylib + 14812) [0x185a8a9dc]
                       *1001  psynch_mtxcontinue + 0 (pthread + 9696) [0xfffffe000b314a80] (blocked by turnstile waiting for telemetryd_v2 [29655] [unique pid 129217] thread 0x3b1aee)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions