[Android] In-Proc Crash Reporter#126916
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, in-process crash reporter for Android CoreCLR that emits a createdump-shaped JSON crash report to logcat/stderr and optionally to a *.crashreport.json file, with VM-provided managed thread/stack/exception callbacks.
Changes:
- Introduces PAL-side crash report generation (JSON writer, module/process name helpers, crash report generator).
- Wires VM callbacks for managed thread enumeration/stack walking/exception extraction into the PAL crash reporter on Android startup.
- Integrates crash report triggering into the Android crash/abort signal path and adds Android-only build plumbing.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/crashreportstackwalker.h | Declares Android-only VM registration hook for crash report stack walking. |
| src/coreclr/vm/crashreportstackwalker.cpp | Implements VM callbacks for managed stack walking, thread enumeration, and exception extraction. |
| src/coreclr/vm/ceemain.cpp | Registers VM callbacks during Android EE startup. |
| src/coreclr/vm/CMakeLists.txt | Adds the new VM crashreport stack walker source to the build. |
| src/coreclr/pal/src/thread/process.cpp | Adds Android crash report initialization, enablement state, and crash-time generation call. |
| src/coreclr/pal/src/include/pal/process.h | Exposes PROCIsCrashReportEnabled() for signal-path gating. |
| src/coreclr/pal/src/exception/signal.cpp | Avoids duplicate managed stack logging when crash reporting is enabled. |
| src/coreclr/pal/src/crashreport/moduleenumerator.h | Declares helpers to resolve process/module names via /proc. |
| src/coreclr/pal/src/crashreport/moduleenumerator.cpp | Implements /proc/self/cmdline and /proc/self/maps parsing for crash-time module/process lookup. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.h | Declares crash report generator API and VM callback hooks. |
| src/coreclr/pal/src/crashreport/inproccrashreporter.cpp | Implements JSON crash report generation, logcat/stderr output, optional file output, and callback integration. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.h | Declares fixed-buffer JSON writer intended for signal-safe usage. |
| src/coreclr/pal/src/crashreport/crashjsonwriter.cpp | Implements the fixed-buffer JSON writer. |
| src/coreclr/pal/src/CMakeLists.txt | Adds Android-only PAL crashreport sources to the build. |
| } | ||
| else if (exceptionType != NULL && exceptionType[0] != '\0') | ||
| { | ||
| char hresultBuffer[32]; |
There was a problem hiding this comment.
Do we want to condition this if a native code fault occurs in the middle of managed exception processing? It seems like the managed throwable would still be on the thread. Would we want to keep the managed throwable for context?
There was a problem hiding this comment.
Looks like this comment have ended up in wrong place ?
There was a problem hiding this comment.
It might no longer be in that spot, but I think the scenario is a native fault while a managed throwable is pinned on the thread, leading to the crash report attributing the crash to that throwable.
| // sized for path-like and identifier-like strings (report paths, process | ||
| // name, type/class names). 32 is sized for a single hex-or-decimal integer | ||
| // formatted as a C string (addresses, thread IDs, hresults). | ||
| static constexpr size_t CRASHREPORT_STRING_BUFFER_SIZE = 256; |
There was a problem hiding this comment.
Not for this PR but might be worth considering if we should have some pre-allocated scratch buffer in crash reporter to avoid stack allocating this in signal handler.
Twelve fixes from lateralusX's review pass on 2026-04-29 plus a build-fix for the cross-components configuration: - Replace GetFilename's hand-rolled scan-for-last-separator loop with strrchr. Adds a nullptr guard for path. dotnet#126916 (comment) - Rename the 'copied' local in CopyString and GetVersionString to 'toCopy'. At the binding site the value is the count about to be copied, not bytes already copied. dotnet#126916 (comment) dotnet#126916 (comment) - Null-check 'writer' / 'ctx' in JsonFrameCallback, ThreadCallback, and FrameCallback. The void* -> typed-pointer cast at a callback boundary is a contract the type system can't enforce; defending it at the callback entry mirrors the existing pattern in ChunkCallback. dotnet#126916 (comment) dotnet#126916 (comment) (FrameCallback fix found via the R3160838988 audit; not separately flagged) - Add ThreadEnumerationContext::EnumerateThreads to wrap the enumerate callback dispatch + EndEnumeration into a single call site. Forgetting EndEnumeration would silently leave the JSON malformed; wrapping removes the obligation from the caller. dotnet#126916 (comment) - Move OnThread, OnFrame (ThreadEnumerationContext) and HandleChunk (CrashReportOutputContext) to the private: section. They're only called from their sibling static thunks; the thunks remain public for the function-pointer ABI. dotnet#126916 (comment) dotnet#126916 (comment) dotnet#126916 (comment) - Replace FormatHexValue's, FormatUnsignedDecimal's, and FormatSignedDecimal's upfront 'bufferSize < MAX_*_BUFFER_SIZE' check with a tighter check tied to the actual write footprint after the digit-extraction loop. Removes the dependency on the constants being correctly aligned with the loops' writes. dotnet#126916 (comment) dotnet#126916 (comment) - Null-check ctx in FrameCallbackAdapter. Same trust-boundary principle as the JsonFrameCallback / ThreadCallback / FrameCallback fixes above, applied to the JIT-stackwalker callback. dotnet#126916 (comment) - Document the GCX_COOP safety reasoning in CrashReportGetExceptionForThread. The transition is safe because the crashing thread holds the thread-store lock taken by CrashReportSuspendThreads, and RareDisablePreemptiveGC's fast-path skips the suspend-blocking check in that condition. dotnet#126916 (comment) - Make the exception-info capture explicit at the signal-handler entry point. InProcCrashReportSignalDispatcher now captures hasException/exTypeBuf/exHresult on the crashing thread and passes them to CreateReport as parameters, instead of CreateReport calling m_getExceptionCallback as its first step. CreateReport's signature grows three parameters; the dispatcher gains a friend declaration so it can read m_getExceptionCallback without exposing it via a public accessor. The behavior is unchanged (the capture was already happening early in CreateReport, on the crashing thread); the structure makes the capture-then-write contract visible at the architecture level rather than buried in CreateReport's body. dotnet#126916 (comment) - Gate the inproccrashreport library on `NOT CLR_CROSS_COMPONENTS_BUILD` in src/coreclr/debug/CMakeLists.txt and src/coreclr/dlls/mscoree/coreclr/CMakeLists.txt. The cross-components build (used to produce cross-arch JITs) does not build PAL, so attaching `inproccrashreport` to `coreclrpal` / `coreclr_static` fails configure with "target_sources called with non-existing target". Surfaced when running the full `build.cmd clr` flow on Android x64. The pre-allocated-scratch-buffer suggestion (dotnet#126916 (comment)) is captured as an explicit "Async-signal-safety hardening" follow-up in the PR description and is not addressed here per the reviewer's "not for this PR" note.
Android does not have a watchdog that proactively kills an app process hung inside its own crash signal handler. ANR detection is scoped to main-thread input responsiveness and does not unconditionally terminate the process; debuggerd's tombstone generation timeout only runs once the crashing thread chains the signal to default disposition, which never happens if the in-proc reporter wedges (e.g. on slow I/O, log backpressure, or a deadlocked thread store). Without a self-imposed deadline, a hung reporter leaves the process in zombie limbo until the user force-stops it. Add a configurable timeout watchdog for in-proc crash report generation so a hung reporter terminates the crashing process via SIGALRM instead of blocking indefinitely. The timeout is read from DOTNET_CrashReportTimeoutSeconds when the reporter is configured and defaults to 30 seconds when the variable is not set so the watchdog kicks in even when callers do not opt in explicitly. Set the value to 0 to disable the watchdog. When the watchdog fires it re-raises the original crash signal (SIGSEGV, SIGABRT, etc.) instead of relying on SIGALRM's default termination, so the system tombstone records the real death cause rather than masking it as "Alarm clock". This matters most in the exact case the watchdog was added for: when the crash report JSON did not finish flushing, the tombstone is the only post-mortem source. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three findings from the Copilot reviewer pass on 2026-04-30: - Move PAL_SetInProcCrashReportCallback next to the other PAL_Set*Callback setters in process.cpp and give it its own doc block. Previously it was defined just below the PROCCreateCrashDumpIfEnabled comment block, which left that comment block orphaned (documenting the function that came *after* PAL_SetInProcCrashReportCallback rather than the function that immediately followed it). dotnet#126916 (comment) - Gate crashreportstackwalker.cpp's inclusion in VM_SOURCES_WKS on NOT CLR_CROSS_COMPONENTS_BUILD. The crashreport object library is already gated this way in src/coreclr/debug/CMakeLists.txt and src/coreclr/dlls/mscoree/coreclr/CMakeLists.txt; without the matching gate here, a cross-components build with FEATURE_INPROC_CRASHREPORT would compile crashreportstackwalker.cpp but fail to link InProcCrashReportInitialize. dotnet#126916 (comment) - Detect truncation in CrashReportHelpers::ExpandDumpTemplate and return 0 instead of a partial length. The previous loop exit on pos + 1 == bufferSize with *pattern != '\0' silently produced a truncated path that could collide with the dump file or write the report to an unexpected location. Returning 0 makes BuildReportPath fail closed. dotnet#126916 (comment) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 2ff349b.
- Inline EndEnumeration into EnumerateThreads (no public-API users) - Use MAX_LONGPATH-sized buffers for crash report paths (was 256, too tight; kept CRASHREPORT_STRING_BUFFER_SIZE=256 for identifiers) - Add %e/%h/%t support to the dump-template expander (matches createdump's FormatDumpName); host name cached at Initialize since gethostname is not async-signal-safe - Capture crashing thread's exception state before suspending the EE so the throwable inspection runs in the thread's natural EE-live context, outside the suspended window - Remove the redundant getExceptionCallback path now that CrashReportEnumerateThreads forwards the captured exception data through threadCallback as the single source of truth - Normalize CopyString call style for m_processName (one line in both the Android /proc/self/cmdline and minipal_getexepath branches) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Android CoreCLR does not currently have the same out-of-proc
createdumpexperience available on other platforms. That makes native crashes, aborts, and mixed managed/native failures significantly harder to diagnose in real applications.This PR adds an opt-in in-proc crash reporter for Android CoreCLR that emits a JSON-formatted crash report modeled after
createdump'sCrashReportWriter. The report is structurally compatible with createdump's output (same envelope, same per-thread / per-frame fields, every value emitted as a quoted JSON string including booleans and numerics) so the same downstream tooling can consume both reports.There are two intentional schema deviations from createdump for Android in-proc reports:
signalinstead ofExceptionType. createdump emits a Windows-shapedExceptionTypefield, which is meaningless on Unix where the crash is identified by the POSIX signal number. The in-proc reporter emits the signal number directly (per earlier reviewer feedback). Tooling that branches on createdump'sExceptionTypeshould add asignal-aware path for Android reports.pidfield. createdump exposes the dumped process's PID because it runs out-of-process; the in-proc reporter runs inside the crashing process, so the PID is implicit and is already encoded in the report filename via the%ptemplate substitution.When enabled, the runtime writes a
*.crashreport.jsonfile directly from the crash path to the location derived fromDOTNET_DbgMiniDumpName.The long-term intent is for this reporting path to be async-signal-safe. This PR starts that work by making the lower-risk / low-hanging pieces fit that model where practical, while leaving the more complex runtime-state publication and hardening work to follow-up PRs.
Enabling the crash reporter
The crash reporter is opt-in, and requires both env vars to be set.
Enable JSON crash reporting
When configured this way, the runtime will write the report to:
DOTNET_DbgMiniDumpNameis required and is expected to be a writable absolute path. Template substitution is supported via%p(PID),%d(PID alias),%e(process / executable name),%h(host name, captured at startup), and%t(UNIX timestamp), matching createdump'sFormatDumpName. IfDOTNET_DbgMiniDumpNameis unset or the file cannot be opened, no report is written. The reporter does not synthesize a fallback path: open-codedTMPDIR/Path.GetTempPathresolution from a fatal-signal context is security-sensitive and intentionally left to the host (this matches createdump's existing contract for the same setting).Current design note
This PR is intentionally the first step, not the complete hardening story.
The payload shape and reporting path are designed with async-signal-safety as the target, but today the implementation is a mix of:
Follow-up PRs will continue hardening the remaining pieces so the implementation better matches the intended strict crash-path constraints.
Example crash report shapes
https://gist.github.com/mdh1418/31b36ef77d3c057ff061b92e28b527d7
Follow-up work
ucontext_tregister extraction, validation on each target, and host-side opt-in wiring.indefinitely when DbgMiniDumpName uses a %p template. Need a host-side rotation policy (cap on file count, age-based cleanup at startup, or single-slot overwrite) appropriate for sandboxed apps.
g_get_tmp_dir/ep_rt_temp_path_gethelpers. A shared async-signal-safe helper (matching the security model of the managedPath.GetTempPath) would let the reporter optionally fall back to a tempdir.getExceptionCallbackonly forSIGSEGV/SIGBUS. A native fault while a managed throwable is pinned on the thread (e.g., during an in-flightEX_THROW) can still surface asSIGABRT/SIGILL/SIGFPEand report a stale exception. A complete fix needs a TLS "currently dispatching" flag in the VM.