Skip to content

KarchinLab/CRAFT

Repository files navigation

CRAFT: Cancer Repertoire Anomaly Finding Transformer

License: MIT DOI

Code accompanying the manuscript:

A generative reference grammar of healthy TCR repertoires reveals cancer-associated immune remodeling Balan et al. (2026)

CRAFT is a conditional sequence-to-sequence transformer built on the BART architecture that learns the generative grammar of T cell receptor (TCR) beta-chain recombination. Trained on 131 million productive rearrangements from 666 healthy donors, CRAFT encodes V, D, and J gene usage alongside nucleotide-level CDR3 structure to produce biologically grounded TCR embeddings via teacher-forced inference.

Quickstart

# 1. Clone the repo and install the craft package in editable mode.
git clone https://github.com/KarchinLab/CRAFT.git
cd CRAFT
pip install -e .
pip install -r requirements.txt

# 2. Point CRAFT at your project directory.
export CRAFT_PROJECT_DIR=$PWD

# 3. Download the trained model checkpoint and tokenizers from Zenodo
#    (DOI 10.5281/zenodo.19891746) and unpack them under $CRAFT_PROJECT_DIR
#    so the on-disk layout is:
#       tokenizers/cdr3_tokenizer.json
#       tokenizers/bart_custom_tokenizer/
#       models/06_model_phase3c/final_model.pth
#    See the "Download model weights" section below for the exact recipe.

# 4. Run the quickstart with a 10-row example file.
python examples/quickstart.py --input examples/example_adaptive_immunoseq.tsv

See examples/README.md for a tour of the three accepted input formats (Adaptive immunoSEQ, AIRR-C, post-IgBLAST).

Download model weights and tokenizers

The trained CRAFT checkpoint, the BART custom tokenizer, and the CDR3 BPE tokenizer are deposited on Zenodo:

DOI: 10.5281/zenodo.19891746

The deposition is a single zip file (craft_zenodo_upload.zip, ~591 MB compressed) — packaged this way so the tokenizers/bart_custom_tokenizer/ subdirectory survives Zenodo's upload, which would otherwise flatten loose files. Download and unpack it from your project root:

cd $CRAFT_PROJECT_DIR
curl -L "https://zenodo.org/records/19891746/files/craft_zenodo_upload.zip" -o craft_zenodo_upload.zip
unzip craft_zenodo_upload.zip

The unpacked layout is exactly what every downstream script expects by default:

$CRAFT_PROJECT_DIR/
├── tokenizers/
│   ├── cdr3_tokenizer.json
│   └── bart_custom_tokenizer/
│       ├── added_tokens.json
│       ├── merges.txt
│       ├── special_tokens_map.json
│       ├── tokenizer_config.json
│       └── vocab.json
└── models/
    └── 06_model_phase3c/
        └── final_model.pth

So most invocations only need --project_dir "$CRAFT_PROJECT_DIR".

Repository structure

CRAFT/
├── craft/                    # Installable Python package (model, datasets, training, inference)
│   ├── model.py              # GeneBARTModelComposite, GeneBARTModelSingleHead, AttentionPooling
│   ├── datasets.py           # GeneDatasetTrain, GeneDatasetInference, GeneDatasetPerplexity
│   ├── training.py           # Training loop & utilities
│   ├── embeddings.py         # Inference utilities (incl. format_tcr_dataframe)
│   ├── plotting.py           # Plotting helpers
│   ├── sctcr.py              # Single-cell TCR helpers
│   └── igblast.py            # IgBLAST junction analyzer
├── model_training/           # Training pipeline scripts
├── ocscc_analyses/           # OCSCC immunotherapy cohort analyses (Figure 2)
├── mcpas_gbm_analyses/       # McPAS antigen benchmark + GBM single-cell analyses (Figures 3, 4)
├── examples/                 # Three-format input examples + quickstart.py
├── .github/workflows/        # CI smoke test
├── pyproject.toml            # Package metadata
├── requirements.txt          # Curated minimal Python deps (20 pinned)
├── requirements-cluster-freeze.txt  # Full bert_pretrain venv freeze (271 deps)
├── r-requirements.txt        # CRAN packages used by gbm_*.R
├── CITATION.cff              # Citation metadata
├── LICENSE                   # MIT
└── README.md

model_training/

Scripts for training the CRAFT model, organized as a numbered pipeline. Each step has a .py script and a companion .sh SLURM launcher. All scripts take --project_dir as a required CLI flag; the 06_phase* training scripts additionally expose every hyperparameter as a CLI flag with the manuscript value as the default.

Step Script prefix Description
0 00_pre_process_data Preprocess raw Adaptive immunoSEQ data
1 01_format_data Format data for model input
2 02_bart_tokenizer Train the BART tokenizer (use --imgt_gene_path for the IMGT human TRB CSV)
3 03_tokenize_genes Tokenize V, D, J genes using IMGT hierarchical scheme
4 04_cdr3_tokenizer Train the CDR3 nucleotide tokenizer
5 05_tokenize_subregions Tokenize CDR3 subregions (V-D and D-J junctions)
6 06_phase{1,2,3}*_train_model Multi-phase curriculum training (phases 1-3 with variants)
8 08_test_decoding_greedy Greedy decoding evaluation
11 11a-d_* Model quality metrics: CDR3 length distributions, perplexity, V/J length distributions, k-mer composition

ocscc_analyses/

Analyses of the oral cavity squamous cell carcinoma (OCSCC) immunotherapy cohort (Luoma et al., 18 patients with neoadjuvant checkpoint blockade).

Prefix Description
00_* Clone frequency computation, amino acid indexing, sample indexing
01_craft_* CRAFT preprocessing, tokenization, embedding extraction, and pooling
02_tcrbert_* TCR-BERT embedding extraction
03_esm_* ESM-1b embedding extraction
04_pca_* PCA dimensionality reduction
05_metrics_* Dispersion and longitudinal repertoire metrics
08_sceptr_* SCEPTR embedding extraction
09_giana_* GIANA embedding extraction

Jupyter notebooks:

  • dispersion_metrics_hnscc.ipynb — Visualization of repertoire dispersion metrics (Figure 2)
  • longitudinal_metrics_hnscc.ipynb — Visualization of longitudinal repertoire remodeling (Figure 2)

mcpas_gbm_analyses/

Analyses for the McPAS-TCR antigen specificity benchmark (Figure 4) and the GBM single-cell immunotherapy cohort (Ling et al., oncolytic HSV-1 therapy; Figure 3).

Embedding pipelines (prefixed 01_ through 09_): same structure as ocscc_analyses/ for CRAFT, TCR-BERT, ESM, SCEPTR, and GIANA, with _nsclc variants for cross-cohort evaluation. 01_craft_02b_tokenize_subregions is an alternative tokenization.

IgBLAST processing (igblast_*): single-cell TCR realignment using IgBLAST with IMGT germline references, with sample-splitting scripts for each cohort (GBM, NSCLC, MIRA, TNBC, Yost).

GBM single-cell analyses (gbm_*):

Script Description
gbm_01* Baseline centroid shift for cell-type compartments
gbm_02* Baseline shift for reactivity-stratified populations
gbm_03_enrichment_tests Binomial tail enrichment across compartments and timepoints
gbm_04-06_viz_* Visualization notebooks for cell-type shifts, reactivity, and enrichment (Figure 3)
gbm_07_shift_labels Displacement-based clonotype labeling
gbm_08_gex_de Differential gene expression (Wilcoxon rank-sum) — R script using optparse
gbm_09_viz_gex Gene expression visualization (Figure 3d)
gbm_fig_* R scripts for final figure assembly (ridgeplots, binomial heatmaps, DE plots)

McPAS-TCR analyses (mcpas_*):

  • mcpas_compute_metrics — Silhouette scores, kNN class-consistency, centroid clustering
  • mcpas_antigen_metrics_all_methods.ipynb — Benchmark comparison across all embedding methods (Figure 4)
  • mcpas_antigen_metrics_no_curr.ipynb — Ablation analysis

Requirements

  • Python ≥ 3.8 (manuscript was produced on Python 3.8.6; tested up to 3.10)
  • R 4.3.0 (for GBM figure generation and differential expression)
  • PyTorch (CPU is sufficient for inference; training requires CUDA)
  • IgBLAST (for single-cell TCR realignment)
  • IMGT germline reference databases

The Python deps are split into:

  • requirements.txt — curated minimal install for the CRAFT core (training, inference, ocscc_analyses, mcpas_gbm_analyses CRAFT/IgBLAST/GBM scripts).
  • requirements-cluster-freeze.txt — verbatim pip freeze of the JHU cluster venv used to produce the manuscript results, for exact reproducibility.

The benchmark-comparison scripts (03_esm_*, 08_sceptr_*, 09_giana_*) require their own dedicated environments due to conflicting upstream dependencies (TCRembedding, sceptr, tidytcells). The .sh launchers EDIT-tag the venv activation line so reproducers can swap envs per method.

R deps are listed in r-requirements.txt.

Running on a SLURM cluster

The .sh launchers under each analysis folder are SLURM batch files configured for the JHU/Rockfish cluster. To run them on a different cluster:

  1. Edit the lines tagged # EDIT: cluster-specific (-A account, --partition=, --qos=).
  2. Uncomment and update the # source /path/to/your/venv/bin/activate line in the scaffolding header.
  3. export CRAFT_PROJECT_DIR=/path/to/your/project_root (or pass --project_dir directly to each script).

Data availability

  • Training data: Emerson et al. — 666 healthy donors, 131M productive TCR beta-chain rearrangements (Adaptive immunoSEQ).
  • OCSCC cohort: Luoma et al. — 18 patients with neoadjuvant checkpoint blockade (NCT02919683).
  • GBM cohort: Ling et al. — recurrent glioblastoma patients with oncolytic HSV-1 therapy (NCT03152318).
  • McPAS-TCR: 5,279 pathology-associated TCRs with antigen annotations.

Citation

If you use CRAFT in your research, please cite:

Balan A, Elhanati Y, Meza Landeros KE, et al. A generative reference
grammar of healthy TCR repertoires reveals cancer-associated immune
remodeling. (2026).

The full author list, affiliations, and machine-readable citation metadata are in CITATION.cff. The accompanying software release is archived on Zenodo at DOI 10.5281/zenodo.19891746.

The manuscript DOI and bioRxiv preprint URL will be added here once available.

License

MIT — see LICENSE.

Copyright © 2026 Karchin Lab, Johns Hopkins University.

About

CRAFT: a generative transformer for embedding TCRβ repertoires and detecting immune perturbation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors