Skip to content

DRAFT: Pipeline: reformat to ascii #129

Open
oshaughn wants to merge 1 commit into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_ralph
Open

DRAFT: Pipeline: reformat to ascii #129
oshaughn wants to merge 1 commit into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_ralph

Conversation

@oshaughn
Copy link
Copy Markdown
Owner

@oshaughn oshaughn commented May 9, 2026

  • RIFT_HYPERPIPELINE_FORMAT in {1,true,yes,on} (case-insensitive) flips every executable in the loop to .dat; anything else (including unset) preserves legacy behaviour.

Second pass of the hyperpipeline-format work.  The first commit
standardised the ILE -> CIP per-shard ASCII chain (lnL composites);
this commit standardises the *return* trip -- the per-iteration
intrinsic-parameter grid CIP hands back to the next ILE iteration via
util_ParameterPuffball.py, util_FetchExternalGrid.py, and the join_grids
job in dag_utils.  When the env var RIFT_HYPERPIPELINE_FORMAT is set, the
entire create_event_parameter_pipeline_BasicIteration workflow now
operates on .dat files end-to-end, mirroring the file-naming and inline
shell glue conventions already used by create_eos_posterior_pipeline.
When the env var is unset, behaviour is byte-identical to the legacy
XML-based pipeline.

Motivation
----------

The first commit covered the ILE -> consolidate -> CIP read direction.
The other half of the iterative loop -- CIP write -> puff -> next ILE
read -- still routed through XML via lalsimutils.ChooseWaveformParams_array_to_xml
and lalsimutils.xml_to_ChooseWaveformParams_array.  Every intermediate
file was either *.xml.gz (overlap-grid-N, puffball-N, fetch-N,
output-grid-*-*) or required a `convert` job to bridge XML <-> ASCII for
downstream consumers.

create_eos_posterior_pipeline already demonstrates the target shape:
.dat files throughout, util_HyperCombine.py for join/consolidate, inline
shell scripts (`head -n 1 ... > $out; cat ... | sort -u | shuf >> $out`)
gluing stages together.  This commit rationalises BasicIteration onto
the same template behind the hyperpipeline env-var flag.

Activation and design choices
-----------------------------

Same env-var convention as commit oshaughn#1: RIFT_HYPERPIPELINE_FORMAT in
{1,true,yes,on} (case-insensitive) flips every executable in the loop
to .dat; anything else (including unset) preserves legacy behaviour.

Per discussion at design time:

* CIP-output grids carry posterior draws with no associated likelihood;
  the mandatory lnL/sigma_lnL columns are filled with 0.  ILE's
  --sim-grid intersection with lalsimutils.valid_params drops them
  harmlessly.

* Puffball + fetch output writers mirror the input file's column header,
  making puff a format-preserving operation that auto-supports any
  optional groups (eccentricity, lambda, eos_index, distance) the
  upstream CIP emits.

* convert_job (LI posterior_samples for --test-args convergence
  testing) is *skipped* in hyperpipeline mode -- its output schema
  (mass_1, spin1x, ...) is a different format than hyperpipeline.  A
  hyperpipeline-aware convergence-test converter is a separate
  workstream.

* Join job follows the EOS-style head-block + cat | sort | uniq | shuf
  shell pattern, not util_HyperCombine.py (which does weighted
  averaging -- wrong semantics for a grid join where every row is a
  unique posterior draw to preserve).

* Mass and distance unit conversions: the hyperpipeline file stores m1,
  m2 in solar masses and distance in Mpc (matching the legacy ILE ASCII
  writer convention), while ChooseWaveformParams stores m1 in kg and
  dist in metres.  hyperpipeline_io.PARAM_DISK_TO_SI declares the
  scaling once; both writer and reader apply it.

* On-disk column names follow the LALInference / posterior-export
  convention (a1x, a1y, a1z, a2x, a2y, a2z), which differs from the
  ChooseWaveformParams attribute names (s1x, s1y, ...).  An alias map
  (COLUMN_ALIAS_DISK_TO_ATTR) bridges the two; this map already
  existed scattered across util_FitAndEvaluate_GenericCoordinates.py
  and util_ConstructIntrinsicPosterior_GenericCoordinates.py and is
  now centralised in hyperpipeline_io.

Files
-----

* RIFT/misc/hyperpipeline_io.py
  Added:
  - PARAM_DISK_TO_SI: declarative unit-scale map for m1, m2, dist.
  - _disk_to_si / _si_to_disk: scalar conversions used by the writer/
    reader.
  - COLUMN_ALIAS_DISK_TO_ATTR / disk_to_attr / attr_to_disk: bridge
    between on-disk a1x...a2z naming and ChooseWaveformParams's s1x...
    s2z attributes.
  - DEFAULT_GRID_SUFFIX / with_grid_suffix: auto-append '.dat' to a
    bare basename, mirroring lalsimutils.ChooseWaveformParams_array_to_xml's
    auto-append of '.xml.gz'.  Lets call sites pass the same basename
    to either writer.
  - write_grid_from_P_list(fname, P_list, columns, lal_module,
    lalsimutils_module, lnL_values, sigma_lnL_values): emit a
    hyperpipeline grid file from a list of ChooseWaveformParams.  Mass
    and distance values are converted from P's SI units to the on-disk
    convention; spin attributes (s1x...) are mapped to the on-disk
    aliases (a1x...).  lnL/sigma_lnL columns default to 0 when no
    values are supplied.
  - read_grid_to_P_list(fname, P_factory, lal_module, valid_params):
    inverse of the writer.  A column is "active" if its disk-name OR
    its alias-resolved attr-name appears in valid_params, so callers
    can pass lalsimutils.valid_params directly and still get spin
    columns assigned correctly.

* bin/integrate_likelihood_extrinsic_batchmode
  Wrapped the --sim-grid path.  When RIFT_HYPERPIPELINE_FORMAT is set
  or the file sniffs as hyperpipeline, dispatches to read_grid_to_P_list
  which applies the on-disk -> SI unit conversion and the column-name
  alias resolution.  The legacy genfromtxt branch is preserved verbatim,
  including its `if P.m1 < 1e15: P.m1 *= MSUN_SI` heuristic.  The
  per-P setup (radec, fref, fmin, tref, dist override) was hoisted out
  of the legacy branch into a shared loop so both branches apply it.

* bin/util_ConstructIntrinsicPosterior_GenericCoordinates.py
  Both ChooseWaveformParams_array_to_xml writer sites (the early-exit
  one inside the `--check-good-enough` path at ~line 2953, and the
  final-export one at ~line 3486) now have a guarded branch that calls
  hyperpipeline_io.write_grid_from_P_list when the env var is set.
  Column set is built from the same opts flags that determined the
  reader column set in commit oshaughn#1 (use_eccentricity, use_meanPerAno,
  input_tides, input_eos_index).

* bin/util_ParameterPuffball.py
  Wrapped both --inj-file (read) and --inj-file-out (write) paths.
  When the input file sniffs as hyperpipeline, P_list is loaded via
  read_grid_to_P_list and the input's column header is preserved on
  output (puff is a format pass-through: perturb rows, keep columns).

* bin/util_FetchExternalGrid.py
  Default base_pattern flips from "overlap-grid-*.xml.gz" to
  "overlap-grid-*.dat" in hyperpipeline mode.  The default `shutil.copyfile`
  branch is format-agnostic and unchanged; the n_max truncation path
  uses hyperpipeline I/O when the source file is hyperpipeline.  The
  legacy bug where the truncated `P_list_reduced` is computed but the
  full `P_list` is exported is preserved verbatim for behaviour
  compatibility.

* RIFT/misc/dag_utils.py
  write_joingrids_sub now generates a different join_grids.sh in
  hyperpipeline mode: it takes the comment block from the first shard
  as the header, then concatenates non-comment data rows from every
  shard, deduplicates with `sort -u`, and shuffles with `shuf` to
  interleave spokes.  Mirrors create_eos_posterior_pipeline's
  join_post.sh pattern exactly.  fname_out's suffix flips from
  '.xml.gz' to '.dat' accordingly.  The legacy igwn_ligolw_add path
  is preserved.

* bin/create_event_parameter_pipeline_BasicIteration
  Defined three pipeline-level variables once near the top:
      _use_hpip_pipeline = env var truthy?
      grid_suffix        = "dat" or "xml.gz"
      sim_grid_flag      = "--sim-grid" or "--sim-xml"
  Then surgically substituted at every site that handles one of the
  four target file families (overlap-grid-N, puffball-N, fetch-N,
  output-grid-*-*):
    - seed-grid copy + count (lines 537-538): for hyperpipeline,
      counts data rows via `wc` after stripping comments instead of
      lalsimutils.xml_to_ChooseWaveformParams_array.
    - command-single.sh writer (line 574): uses sim_grid_flag.
    - ile_args / ile_args_forpuff / ile_args_forfetch (lines 583-585):
      uses sim_grid_flag and grid_suffix.
    - transfer_file_names / transfer_file_names_puff /
      transfer_file_names_fetch (lines 735, 757, 772).
    - join_cip_job, puff_job, fetch_job sub creation (lines 1048,
      1064, 1082).
    - regenerate-self command (line 1633), subdag link (line 1708),
      fetch_subdag (lines 1724, 1726).
    - cip_check_work.sh monitoring (line 2080): hyperpipeline branch
      counts non-comment data rows via `grep -hE -v '^#' | grep -cv
      '^[[:space:]]*$'` instead of `igwn_ligolw_print -c mass1 |
      wc -l`.
  convert_job (line 1199, gated by --test-args) is wrapped in
  `if not _use_hpip_pipeline:`; the matching DAG-node creation site
  at line 1857 is gated identically.  EXTR_out / convertExtr /
  batchConvertExtr paths are intentionally untouched -- they handle a
  different file family (per-shard ILE extrinsic posterior export)
  outside the CIP -> ILE intrinsic-grid handoff.

Tests
-----

test/test_hyperpipeline_io.py extended to 17 tests (was 12).  New
coverage:

* Grid round-trip with mass unit conversion (kg <-> solar mass).
* Distance unit conversion (m <-> Mpc).
* Auto-suffix append (passes a basename, gets back foo.dat).
* Column alias bridge (a1x written from P.s1x, read back into P.s1x).
* lal_module=None passthrough (no unit conversion when lal not
  available).

Tests use _FakeP and _FakeLal stand-ins so they run without lalsuite/
scipy; the alias map and unit constants are exercised exactly as the
real ChooseWaveformParams / lal would expose them.

A separate end-to-end smoke test (run inline, not committed) drove a
synthetic CIP write -> puff read+perturb -> ILE read cycle and
verified mass round-trip through the full chain plus spin perturbation
visibility.  A second smoke test exercised the hyperpipeline
join_grids.sh shell script against three synthetic per-spoke shards
and confirmed the composite parses cleanly.

Not yet touched
---------------

This commit covers BasicIteration only.  Sibling pipeline drivers that
reuse the same grid-file conventions and would benefit from the same
treatment:

* bin/cepp_basic_htcondor (htcondor-only twin of BasicIteration)
* bin/util_RIFT_pseudo_pipe.py
* bin/util_RIFT_pseudo_pipe_lowlatency.py
* bin/util_RIFT_hyperpipe.py

Also unchanged by design (per scope decisions):

* convert_output_format_ile2inference and the EXTR_out -> LI
  posterior_samples path -- different file family, different schema.
* util_CleanILE.py (legacy still routes through it; the new
  util_CleanILE_hyperpipeline.py from commit oshaughn#1 handles hyperpipeline).
* The convergence-test convert_job that builds posterior_samples-N.dat
  from overlap-grid -- skipped in hyperpipeline mode; a hyperpipeline-
  aware test converter is a separate workstream.

Followups
---------

Once the sibling drivers above are converted, util_ParameterPuffball.py's
--inj-file help text and the related argparse defaults can drop
references to "XML file" in favour of format-agnostic phrasing.  The
EXTR_out -> LI conversion path is the last large consumer of XML in the
intrinsic-pipeline domain and is the natural target after that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant