DRAFT: Pipeline: reformat to ascii #129
Open
oshaughn wants to merge 1 commit into
Open
Conversation
Owner
oshaughn
commented
May 9, 2026
- RIFT_HYPERPIPELINE_FORMAT in {1,true,yes,on} (case-insensitive) flips every executable in the loop to .dat; anything else (including unset) preserves legacy behaviour.
Second pass of the hyperpipeline-format work. The first commit standardised the ILE -> CIP per-shard ASCII chain (lnL composites); this commit standardises the *return* trip -- the per-iteration intrinsic-parameter grid CIP hands back to the next ILE iteration via util_ParameterPuffball.py, util_FetchExternalGrid.py, and the join_grids job in dag_utils. When the env var RIFT_HYPERPIPELINE_FORMAT is set, the entire create_event_parameter_pipeline_BasicIteration workflow now operates on .dat files end-to-end, mirroring the file-naming and inline shell glue conventions already used by create_eos_posterior_pipeline. When the env var is unset, behaviour is byte-identical to the legacy XML-based pipeline. Motivation ---------- The first commit covered the ILE -> consolidate -> CIP read direction. The other half of the iterative loop -- CIP write -> puff -> next ILE read -- still routed through XML via lalsimutils.ChooseWaveformParams_array_to_xml and lalsimutils.xml_to_ChooseWaveformParams_array. Every intermediate file was either *.xml.gz (overlap-grid-N, puffball-N, fetch-N, output-grid-*-*) or required a `convert` job to bridge XML <-> ASCII for downstream consumers. create_eos_posterior_pipeline already demonstrates the target shape: .dat files throughout, util_HyperCombine.py for join/consolidate, inline shell scripts (`head -n 1 ... > $out; cat ... | sort -u | shuf >> $out`) gluing stages together. This commit rationalises BasicIteration onto the same template behind the hyperpipeline env-var flag. Activation and design choices ----------------------------- Same env-var convention as commit oshaughn#1: RIFT_HYPERPIPELINE_FORMAT in {1,true,yes,on} (case-insensitive) flips every executable in the loop to .dat; anything else (including unset) preserves legacy behaviour. Per discussion at design time: * CIP-output grids carry posterior draws with no associated likelihood; the mandatory lnL/sigma_lnL columns are filled with 0. ILE's --sim-grid intersection with lalsimutils.valid_params drops them harmlessly. * Puffball + fetch output writers mirror the input file's column header, making puff a format-preserving operation that auto-supports any optional groups (eccentricity, lambda, eos_index, distance) the upstream CIP emits. * convert_job (LI posterior_samples for --test-args convergence testing) is *skipped* in hyperpipeline mode -- its output schema (mass_1, spin1x, ...) is a different format than hyperpipeline. A hyperpipeline-aware convergence-test converter is a separate workstream. * Join job follows the EOS-style head-block + cat | sort | uniq | shuf shell pattern, not util_HyperCombine.py (which does weighted averaging -- wrong semantics for a grid join where every row is a unique posterior draw to preserve). * Mass and distance unit conversions: the hyperpipeline file stores m1, m2 in solar masses and distance in Mpc (matching the legacy ILE ASCII writer convention), while ChooseWaveformParams stores m1 in kg and dist in metres. hyperpipeline_io.PARAM_DISK_TO_SI declares the scaling once; both writer and reader apply it. * On-disk column names follow the LALInference / posterior-export convention (a1x, a1y, a1z, a2x, a2y, a2z), which differs from the ChooseWaveformParams attribute names (s1x, s1y, ...). An alias map (COLUMN_ALIAS_DISK_TO_ATTR) bridges the two; this map already existed scattered across util_FitAndEvaluate_GenericCoordinates.py and util_ConstructIntrinsicPosterior_GenericCoordinates.py and is now centralised in hyperpipeline_io. Files ----- * RIFT/misc/hyperpipeline_io.py Added: - PARAM_DISK_TO_SI: declarative unit-scale map for m1, m2, dist. - _disk_to_si / _si_to_disk: scalar conversions used by the writer/ reader. - COLUMN_ALIAS_DISK_TO_ATTR / disk_to_attr / attr_to_disk: bridge between on-disk a1x...a2z naming and ChooseWaveformParams's s1x... s2z attributes. - DEFAULT_GRID_SUFFIX / with_grid_suffix: auto-append '.dat' to a bare basename, mirroring lalsimutils.ChooseWaveformParams_array_to_xml's auto-append of '.xml.gz'. Lets call sites pass the same basename to either writer. - write_grid_from_P_list(fname, P_list, columns, lal_module, lalsimutils_module, lnL_values, sigma_lnL_values): emit a hyperpipeline grid file from a list of ChooseWaveformParams. Mass and distance values are converted from P's SI units to the on-disk convention; spin attributes (s1x...) are mapped to the on-disk aliases (a1x...). lnL/sigma_lnL columns default to 0 when no values are supplied. - read_grid_to_P_list(fname, P_factory, lal_module, valid_params): inverse of the writer. A column is "active" if its disk-name OR its alias-resolved attr-name appears in valid_params, so callers can pass lalsimutils.valid_params directly and still get spin columns assigned correctly. * bin/integrate_likelihood_extrinsic_batchmode Wrapped the --sim-grid path. When RIFT_HYPERPIPELINE_FORMAT is set or the file sniffs as hyperpipeline, dispatches to read_grid_to_P_list which applies the on-disk -> SI unit conversion and the column-name alias resolution. The legacy genfromtxt branch is preserved verbatim, including its `if P.m1 < 1e15: P.m1 *= MSUN_SI` heuristic. The per-P setup (radec, fref, fmin, tref, dist override) was hoisted out of the legacy branch into a shared loop so both branches apply it. * bin/util_ConstructIntrinsicPosterior_GenericCoordinates.py Both ChooseWaveformParams_array_to_xml writer sites (the early-exit one inside the `--check-good-enough` path at ~line 2953, and the final-export one at ~line 3486) now have a guarded branch that calls hyperpipeline_io.write_grid_from_P_list when the env var is set. Column set is built from the same opts flags that determined the reader column set in commit oshaughn#1 (use_eccentricity, use_meanPerAno, input_tides, input_eos_index). * bin/util_ParameterPuffball.py Wrapped both --inj-file (read) and --inj-file-out (write) paths. When the input file sniffs as hyperpipeline, P_list is loaded via read_grid_to_P_list and the input's column header is preserved on output (puff is a format pass-through: perturb rows, keep columns). * bin/util_FetchExternalGrid.py Default base_pattern flips from "overlap-grid-*.xml.gz" to "overlap-grid-*.dat" in hyperpipeline mode. The default `shutil.copyfile` branch is format-agnostic and unchanged; the n_max truncation path uses hyperpipeline I/O when the source file is hyperpipeline. The legacy bug where the truncated `P_list_reduced` is computed but the full `P_list` is exported is preserved verbatim for behaviour compatibility. * RIFT/misc/dag_utils.py write_joingrids_sub now generates a different join_grids.sh in hyperpipeline mode: it takes the comment block from the first shard as the header, then concatenates non-comment data rows from every shard, deduplicates with `sort -u`, and shuffles with `shuf` to interleave spokes. Mirrors create_eos_posterior_pipeline's join_post.sh pattern exactly. fname_out's suffix flips from '.xml.gz' to '.dat' accordingly. The legacy igwn_ligolw_add path is preserved. * bin/create_event_parameter_pipeline_BasicIteration Defined three pipeline-level variables once near the top: _use_hpip_pipeline = env var truthy? grid_suffix = "dat" or "xml.gz" sim_grid_flag = "--sim-grid" or "--sim-xml" Then surgically substituted at every site that handles one of the four target file families (overlap-grid-N, puffball-N, fetch-N, output-grid-*-*): - seed-grid copy + count (lines 537-538): for hyperpipeline, counts data rows via `wc` after stripping comments instead of lalsimutils.xml_to_ChooseWaveformParams_array. - command-single.sh writer (line 574): uses sim_grid_flag. - ile_args / ile_args_forpuff / ile_args_forfetch (lines 583-585): uses sim_grid_flag and grid_suffix. - transfer_file_names / transfer_file_names_puff / transfer_file_names_fetch (lines 735, 757, 772). - join_cip_job, puff_job, fetch_job sub creation (lines 1048, 1064, 1082). - regenerate-self command (line 1633), subdag link (line 1708), fetch_subdag (lines 1724, 1726). - cip_check_work.sh monitoring (line 2080): hyperpipeline branch counts non-comment data rows via `grep -hE -v '^#' | grep -cv '^[[:space:]]*$'` instead of `igwn_ligolw_print -c mass1 | wc -l`. convert_job (line 1199, gated by --test-args) is wrapped in `if not _use_hpip_pipeline:`; the matching DAG-node creation site at line 1857 is gated identically. EXTR_out / convertExtr / batchConvertExtr paths are intentionally untouched -- they handle a different file family (per-shard ILE extrinsic posterior export) outside the CIP -> ILE intrinsic-grid handoff. Tests ----- test/test_hyperpipeline_io.py extended to 17 tests (was 12). New coverage: * Grid round-trip with mass unit conversion (kg <-> solar mass). * Distance unit conversion (m <-> Mpc). * Auto-suffix append (passes a basename, gets back foo.dat). * Column alias bridge (a1x written from P.s1x, read back into P.s1x). * lal_module=None passthrough (no unit conversion when lal not available). Tests use _FakeP and _FakeLal stand-ins so they run without lalsuite/ scipy; the alias map and unit constants are exercised exactly as the real ChooseWaveformParams / lal would expose them. A separate end-to-end smoke test (run inline, not committed) drove a synthetic CIP write -> puff read+perturb -> ILE read cycle and verified mass round-trip through the full chain plus spin perturbation visibility. A second smoke test exercised the hyperpipeline join_grids.sh shell script against three synthetic per-spoke shards and confirmed the composite parses cleanly. Not yet touched --------------- This commit covers BasicIteration only. Sibling pipeline drivers that reuse the same grid-file conventions and would benefit from the same treatment: * bin/cepp_basic_htcondor (htcondor-only twin of BasicIteration) * bin/util_RIFT_pseudo_pipe.py * bin/util_RIFT_pseudo_pipe_lowlatency.py * bin/util_RIFT_hyperpipe.py Also unchanged by design (per scope decisions): * convert_output_format_ile2inference and the EXTR_out -> LI posterior_samples path -- different file family, different schema. * util_CleanILE.py (legacy still routes through it; the new util_CleanILE_hyperpipeline.py from commit oshaughn#1 handles hyperpipeline). * The convergence-test convert_job that builds posterior_samples-N.dat from overlap-grid -- skipped in hyperpipeline mode; a hyperpipeline- aware test converter is a separate workstream. Followups --------- Once the sibling drivers above are converted, util_ParameterPuffball.py's --inj-file help text and the related argparse defaults can drop references to "XML file" in favour of format-agnostic phrasing. The EXTR_out -> LI conversion path is the last large consumer of XML in the intrinsic-pipeline domain and is the natural target after that.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.