Skip to content

[Android] In-Proc Crash Reporter#126916

Open
mdh1418 wants to merge 58 commits intodotnet:mainfrom
mdh1418:inproc_crashreport_minimal
Open

[Android] In-Proc Crash Reporter#126916
mdh1418 wants to merge 58 commits intodotnet:mainfrom
mdh1418:inproc_crashreport_minimal

Conversation

@mdh1418
Copy link
Copy Markdown
Member

@mdh1418 mdh1418 commented Apr 14, 2026

Android CoreCLR does not currently have the same out-of-proc createdump experience available on other platforms. That makes native crashes, aborts, and mixed managed/native failures significantly harder to diagnose in real applications.

This PR adds an opt-in in-proc crash reporter for Android CoreCLR that emits a JSON-formatted crash report modeled after createdump's CrashReportWriter. The report is structurally compatible with createdump's output (same envelope, same per-thread / per-frame fields, every value emitted as a quoted JSON string including booleans and numerics) so the same downstream tooling can consume both reports.

There are two intentional schema deviations from createdump for Android in-proc reports:

  • signal instead of ExceptionType. createdump emits a Windows-shaped ExceptionType field, which is meaningless on Unix where the crash is identified by the POSIX signal number. The in-proc reporter emits the signal number directly (per earlier reviewer feedback). Tooling that branches on createdump's ExceptionType should add a signal-aware path for Android reports.
  • No root pid field. createdump exposes the dumped process's PID because it runs out-of-process; the in-proc reporter runs inside the crashing process, so the PID is implicit and is already encoded in the report filename via the %p template substitution.

When enabled, the runtime writes a *.crashreport.json file directly from the crash path to the location derived from DOTNET_DbgMiniDumpName.

The long-term intent is for this reporting path to be async-signal-safe. This PR starts that work by making the lower-risk / low-hanging pieces fit that model where practical, while leaving the more complex runtime-state publication and hardening work to follow-up PRs.

Enabling the crash reporter

The crash reporter is opt-in, and requires both env vars to be set.

Enable JSON crash reporting

DOTNET_EnableCrashReport=1
DOTNET_DbgMiniDumpName=/data/data/<package>/files/dotnet_crash_%p

When configured this way, the runtime will write the report to:

/data/data/<package>/files/dotnet_crash_<pid>.crashreport.json

DOTNET_DbgMiniDumpName is required and is expected to be a writable absolute path. Template substitution is supported via %p (PID), %d (PID alias), %e (process / executable name), %h (host name, captured at startup), and %t (UNIX timestamp), matching createdump's FormatDumpName. If DOTNET_DbgMiniDumpName is unset or the file cannot be opened, no report is written. The reporter does not synthesize a fallback path: open-coded TMPDIR / Path.GetTempPath resolution from a fatal-signal context is security-sensitive and intentionally left to the host (this matches createdump's existing contract for the same setting).

Current design note

This PR is intentionally the first step, not the complete hardening story.

The payload shape and reporting path are designed with async-signal-safety as the target, but today the implementation is a mix of:

  • crash-time logic that is already close to signal-safe
  • best-effort runtime inspection for richer managed diagnostics

Follow-up PRs will continue hardening the remaining pieces so the implementation better matches the intended strict crash-path constraints.

Example crash report shapes

https://gist.github.com/mdh1418/31b36ef77d3c057ff061b92e28b527d7

Follow-up work

  • Cross-platform support The reporter library and the PAL callback ABI are already platform-neutral; what remains is per-platform ucontext_t register extraction, validation on each target, and host-side opt-in wiring.
  • Emit the report to logcat / stderr in addition to the file. Useful when no writable file location is available or when capturing logs from CI / device farms.
  • Crash-report file lifecycle on Android. The app's private files/ directory has no OS-level rotation (unlike desktop where tmpwatch / systemd / a CI step typically prunes), so a crash-looping app can accumulate *.crashreport.json files
    indefinitely when DbgMiniDumpName uses a %p template. Need a host-side rotation policy (cap on file count, age-based cleanup at startup, or single-slot overwrite) appropriate for sandboxed apps.
  • Cross-runtime temp-directory helper. Mono / EventPipe also have ad-hoc g_get_tmp_dir / ep_rt_temp_path_get helpers. A shared async-signal-safe helper (matching the security model of the managed Path.GetTempPath) would let the reporter optionally fall back to a tempdir.
  • Misleading-throwable suppression. The reporter currently skips getExceptionCallback only for SIGSEGV/SIGBUS. A native fault while a managed throwable is pinned on the thread (e.g., during an in-flight EX_THROW) can still surface as SIGABRT/SIGILL/SIGFPE and report a stale exception. A complete fix needs a TLS "currently dispatching" flag in the VM.
  • Stress log integration. Dump the in-process stress log into the JSON report (and/or logcat) when enabled.
  • Async-signal-safety hardening of the remaining best-effort paths. Replace the remaining VM helpers in the crash path with strictly signal-safe equivalents (frame metadata lookup, IL offset resolution, etc.).

mdh1418 and others added 9 commits April 14, 2026 10:03
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 14, 2026 22:37
@github-actions github-actions Bot added the area-PAL-coreclr only for closed issues label Apr 14, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

@mdh1418 mdh1418 changed the title Inproc crashreport minimal [Android] In-Proc Crash Reporter Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in, in-process crash reporter for Android CoreCLR that emits a createdump-shaped JSON crash report to logcat/stderr and optionally to a *.crashreport.json file, with VM-provided managed thread/stack/exception callbacks.

Changes:

  • Introduces PAL-side crash report generation (JSON writer, module/process name helpers, crash report generator).
  • Wires VM callbacks for managed thread enumeration/stack walking/exception extraction into the PAL crash reporter on Android startup.
  • Integrates crash report triggering into the Android crash/abort signal path and adds Android-only build plumbing.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/coreclr/vm/crashreportstackwalker.h Declares Android-only VM registration hook for crash report stack walking.
src/coreclr/vm/crashreportstackwalker.cpp Implements VM callbacks for managed stack walking, thread enumeration, and exception extraction.
src/coreclr/vm/ceemain.cpp Registers VM callbacks during Android EE startup.
src/coreclr/vm/CMakeLists.txt Adds the new VM crashreport stack walker source to the build.
src/coreclr/pal/src/thread/process.cpp Adds Android crash report initialization, enablement state, and crash-time generation call.
src/coreclr/pal/src/include/pal/process.h Exposes PROCIsCrashReportEnabled() for signal-path gating.
src/coreclr/pal/src/exception/signal.cpp Avoids duplicate managed stack logging when crash reporting is enabled.
src/coreclr/pal/src/crashreport/moduleenumerator.h Declares helpers to resolve process/module names via /proc.
src/coreclr/pal/src/crashreport/moduleenumerator.cpp Implements /proc/self/cmdline and /proc/self/maps parsing for crash-time module/process lookup.
src/coreclr/pal/src/crashreport/inproccrashreporter.h Declares crash report generator API and VM callback hooks.
src/coreclr/pal/src/crashreport/inproccrashreporter.cpp Implements JSON crash report generation, logcat/stderr output, optional file output, and callback integration.
src/coreclr/pal/src/crashreport/crashjsonwriter.h Declares fixed-buffer JSON writer intended for signal-safe usage.
src/coreclr/pal/src/crashreport/crashjsonwriter.cpp Implements the fixed-buffer JSON writer.
src/coreclr/pal/src/CMakeLists.txt Adds Android-only PAL crashreport sources to the build.

Comment thread src/coreclr/pal/src/thread/process.cpp Outdated
Comment thread src/coreclr/pal/src/thread/process.cpp Outdated
Comment thread src/coreclr/pal/src/crashreport/moduleenumerator.cpp Outdated
Comment thread src/coreclr/vm/crashreportstackwalker.cpp Outdated
Comment thread src/coreclr/debug/crashreport/crashjsonwriter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/vm/crashreportstackwalker.cpp Outdated
}
else if (exceptionType != NULL && exceptionType[0] != '\0')
{
char hresultBuffer[32];
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to condition this if a native code fault occurs in the middle of managed exception processing? It seems like the managed throwable would still be on the thread. Would we want to keep the managed throwable for context?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment have ended up in wrong place ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might no longer be in that spot, but I think the scenario is a native fault while a managed throwable is pinned on the thread, leading to the crash report attributing the crash to that throwable.

Comment thread src/coreclr/debug/crashreport/signalsafejsonwriter.cpp
Comment thread src/coreclr/vm/crashreportstackwalker.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.h Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/pal/src/crashreport/moduleenumerator.cpp Outdated
Comment thread src/coreclr/pal/src/exception/signal.cpp
Comment thread src/coreclr/debug/crashreport/crashjsonwriter.h Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
// sized for path-like and identifier-like strings (report paths, process
// name, type/class names). 32 is sized for a single hex-or-decimal integer
// formatted as a C string (addresses, thread IDs, hresults).
static constexpr size_t CRASHREPORT_STRING_BUFFER_SIZE = 256;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but might be worth considering if we should have some pre-allocated scratch buffer in crash reporter to avoid stack allocating this in signal handler.

Comment thread src/coreclr/debug/crashreport/signalsafejsonwriter.cpp
Comment thread src/coreclr/debug/crashreport/signalsafejsonwriter.cpp
Comment thread src/coreclr/vm/crashreportstackwalker.cpp
Comment thread src/coreclr/vm/crashreportstackwalker.cpp
Comment thread src/coreclr/vm/crashreportstackwalker.cpp
mdh1418 and others added 2 commits April 29, 2026 12:11
Twelve fixes from lateralusX's review pass on 2026-04-29 plus a
build-fix for the cross-components configuration:

- Replace GetFilename's hand-rolled scan-for-last-separator loop with
  strrchr. Adds a nullptr guard for path.
  dotnet#126916 (comment)

- Rename the 'copied' local in CopyString and GetVersionString to 'toCopy'.
  At the binding site the value is the count about to be copied, not bytes
  already copied.
  dotnet#126916 (comment)
  dotnet#126916 (comment)

- Null-check 'writer' / 'ctx' in JsonFrameCallback, ThreadCallback, and
  FrameCallback. The void* -> typed-pointer cast at a callback boundary
  is a contract the type system can't enforce; defending it at the
  callback entry mirrors the existing pattern in ChunkCallback.
  dotnet#126916 (comment)
  dotnet#126916 (comment)
  (FrameCallback fix found via the R3160838988 audit; not separately flagged)

- Add ThreadEnumerationContext::EnumerateThreads to wrap the enumerate
  callback dispatch + EndEnumeration into a single call site. Forgetting
  EndEnumeration would silently leave the JSON malformed; wrapping
  removes the obligation from the caller.
  dotnet#126916 (comment)

- Move OnThread, OnFrame (ThreadEnumerationContext) and HandleChunk
  (CrashReportOutputContext) to the private: section. They're only
  called from their sibling static thunks; the thunks remain public for
  the function-pointer ABI.
  dotnet#126916 (comment)
  dotnet#126916 (comment)
  dotnet#126916 (comment)

- Replace FormatHexValue's, FormatUnsignedDecimal's, and
  FormatSignedDecimal's upfront 'bufferSize < MAX_*_BUFFER_SIZE' check
  with a tighter check tied to the actual write footprint after the
  digit-extraction loop. Removes the dependency on the constants being
  correctly aligned with the loops' writes.
  dotnet#126916 (comment)
  dotnet#126916 (comment)

- Null-check ctx in FrameCallbackAdapter. Same trust-boundary principle
  as the JsonFrameCallback / ThreadCallback / FrameCallback fixes above,
  applied to the JIT-stackwalker callback.
  dotnet#126916 (comment)

- Document the GCX_COOP safety reasoning in CrashReportGetExceptionForThread.
  The transition is safe because the crashing thread holds the
  thread-store lock taken by CrashReportSuspendThreads, and
  RareDisablePreemptiveGC's fast-path skips the suspend-blocking check
  in that condition.
  dotnet#126916 (comment)

- Make the exception-info capture explicit at the signal-handler entry
  point. InProcCrashReportSignalDispatcher now captures
  hasException/exTypeBuf/exHresult on the crashing thread and passes
  them to CreateReport as parameters, instead of CreateReport calling
  m_getExceptionCallback as its first step. CreateReport's signature
  grows three parameters; the dispatcher gains a friend declaration so
  it can read m_getExceptionCallback without exposing it via a public
  accessor. The behavior is unchanged (the capture was already happening
  early in CreateReport, on the crashing thread); the structure makes
  the capture-then-write contract visible at the architecture level
  rather than buried in CreateReport's body.
  dotnet#126916 (comment)

- Gate the inproccrashreport library on `NOT CLR_CROSS_COMPONENTS_BUILD`
  in src/coreclr/debug/CMakeLists.txt and src/coreclr/dlls/mscoree/coreclr/CMakeLists.txt.
  The cross-components build (used to produce cross-arch JITs) does not
  build PAL, so attaching `inproccrashreport` to `coreclrpal` /
  `coreclr_static` fails configure with "target_sources called with
  non-existing target". Surfaced when running the full
  `build.cmd clr` flow on Android x64.

The pre-allocated-scratch-buffer suggestion
(dotnet#126916 (comment)) is
captured as an explicit "Async-signal-safety hardening" follow-up in the
PR description and is not addressed here per the reviewer's "not for this
PR" note.
Android does not have a watchdog that proactively kills an app
process hung inside its own crash signal handler. ANR detection is
scoped to main-thread input responsiveness and does not
unconditionally terminate the process; debuggerd's tombstone
generation timeout only runs once the crashing thread chains the
signal to default disposition, which never happens if the in-proc
reporter wedges (e.g. on slow I/O, log backpressure, or a deadlocked
thread store). Without a self-imposed deadline, a hung reporter
leaves the process in zombie limbo until the user force-stops it.

Add a configurable timeout watchdog for in-proc crash report
generation so a hung reporter terminates the crashing process via
SIGALRM instead of blocking indefinitely. The timeout is read from
DOTNET_CrashReportTimeoutSeconds when the reporter is configured and
defaults to 30 seconds when the variable is not set so the watchdog
kicks in even when callers do not opt in explicitly. Set the value
to 0 to disable the watchdog.

When the watchdog fires it re-raises the original crash signal
(SIGSEGV, SIGABRT, etc.) instead of relying on SIGALRM's default
termination, so the system tombstone records the real death cause
rather than masking it as "Alarm clock". This matters most in the
exact case the watchdog was added for: when the crash report JSON
did not finish flushing, the tombstone is the only post-mortem
source.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 30, 2026 00:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comment thread src/coreclr/pal/src/thread/process.cpp Outdated
Comment thread src/coreclr/vm/CMakeLists.txt Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp
mdh1418 and others added 2 commits April 29, 2026 21:35
Three findings from the Copilot reviewer pass on 2026-04-30:

- Move PAL_SetInProcCrashReportCallback next to the other PAL_Set*Callback
  setters in process.cpp and give it its own doc block. Previously it was
  defined just below the PROCCreateCrashDumpIfEnabled comment block, which
  left that comment block orphaned (documenting the function that came
  *after* PAL_SetInProcCrashReportCallback rather than the function that
  immediately followed it).
  dotnet#126916 (comment)

- Gate crashreportstackwalker.cpp's inclusion in VM_SOURCES_WKS on
  NOT CLR_CROSS_COMPONENTS_BUILD. The crashreport object library is
  already gated this way in src/coreclr/debug/CMakeLists.txt and
  src/coreclr/dlls/mscoree/coreclr/CMakeLists.txt; without the matching
  gate here, a cross-components build with FEATURE_INPROC_CRASHREPORT
  would compile crashreportstackwalker.cpp but fail to link
  InProcCrashReportInitialize.
  dotnet#126916 (comment)

- Detect truncation in CrashReportHelpers::ExpandDumpTemplate and return
  0 instead of a partial length. The previous loop exit on
  pos + 1 == bufferSize with *pattern != '\0' silently produced a
  truncated path that could collide with the dump file or write the
  report to an unexpected location. Returning 0 makes BuildReportPath
  fail closed.
  dotnet#126916 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Copilot AI review requested due to automatic review settings May 4, 2026 15:15
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.h Outdated
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/vm/crashreportstackwalker.cpp
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp
Comment thread src/coreclr/debug/crashreport/inproccrashreporter.cpp Outdated
Comment thread src/coreclr/vm/crashreportstackwalker.cpp Outdated
- Inline EndEnumeration into EnumerateThreads (no public-API users)
- Use MAX_LONGPATH-sized buffers for crash report paths (was 256, too
  tight; kept CRASHREPORT_STRING_BUFFER_SIZE=256 for identifiers)
- Add %e/%h/%t support to the dump-template expander (matches
  createdump's FormatDumpName); host name cached at Initialize since
  gethostname is not async-signal-safe
- Capture crashing thread's exception state before suspending the EE
  so the throwable inspection runs in the thread's natural EE-live
  context, outside the suspended window
- Remove the redundant getExceptionCallback path now that
  CrashReportEnumerateThreads forwards the captured exception data
  through threadCallback as the single source of truth
- Normalize CopyString call style for m_processName (one line in both
  the Android /proc/self/cmdline and minipal_getexepath branches)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants