CRAFT: Cancer Repertoire Anomaly Finding Transformer

Code accompanying the manuscript:

A generative reference grammar of healthy TCR repertoires reveals cancer-associated immune remodeling Balan et al. (2026)

CRAFT is a conditional sequence-to-sequence transformer built on the BART architecture that learns the generative grammar of T cell receptor (TCR) beta-chain recombination. Trained on 131 million productive rearrangements from 666 healthy donors, CRAFT encodes V, D, and J gene usage alongside nucleotide-level CDR3 structure to produce biologically grounded TCR embeddings via teacher-forced inference.

Quickstart

# 1. Clone the repo and install the craft package in editable mode.
git clone https://github.com/KarchinLab/CRAFT.git
cd CRAFT
pip install -e .
pip install -r requirements.txt

# 2. Point CRAFT at your project directory.
export CRAFT_PROJECT_DIR=$PWD

# 3. Download the trained model checkpoint and tokenizers from Zenodo
#    (DOI 10.5281/zenodo.19891746) and unpack them under $CRAFT_PROJECT_DIR
#    so the on-disk layout is:
#       tokenizers/cdr3_tokenizer.json
#       tokenizers/bart_custom_tokenizer/
#       models/06_model_phase3c/final_model.pth
#    See the "Download model weights" section below for the exact recipe.

# 4. Run the quickstart with a 10-row example file.
python examples/quickstart.py --input examples/example_adaptive_immunoseq.tsv

See examples/README.md for a tour of the three accepted input formats (Adaptive immunoSEQ, AIRR-C, post-IgBLAST).

Download model weights and tokenizers

The trained CRAFT checkpoint, the BART custom tokenizer, and the CDR3 BPE tokenizer are deposited on Zenodo:

DOI: 10.5281/zenodo.19891746

The deposition is a single zip file (craft_zenodo_upload.zip, ~591 MB compressed) — packaged this way so the tokenizers/bart_custom_tokenizer/ subdirectory survives Zenodo's upload, which would otherwise flatten loose files. Download and unpack it from your project root:

cd $CRAFT_PROJECT_DIR
curl -L "https://zenodo.org/records/19891746/files/craft_zenodo_upload.zip" -o craft_zenodo_upload.zip
unzip craft_zenodo_upload.zip

The unpacked layout is exactly what every downstream script expects by default:

$CRAFT_PROJECT_DIR/
├── tokenizers/
│   ├── cdr3_tokenizer.json
│   └── bart_custom_tokenizer/
│       ├── added_tokens.json
│       ├── merges.txt
│       ├── special_tokens_map.json
│       ├── tokenizer_config.json
│       └── vocab.json
└── models/
    └── 06_model_phase3c/
        └── final_model.pth

So most invocations only need --project_dir "$CRAFT_PROJECT_DIR".

Repository structure

CRAFT/
├── craft/                    # Installable Python package (model, datasets, training, inference)
│   ├── model.py              # GeneBARTModelComposite, GeneBARTModelSingleHead, AttentionPooling
│   ├── datasets.py           # GeneDatasetTrain, GeneDatasetInference, GeneDatasetPerplexity
│   ├── training.py           # Training loop & utilities
│   ├── embeddings.py         # Inference utilities (incl. format_tcr_dataframe)
│   ├── plotting.py           # Plotting helpers
│   ├── sctcr.py              # Single-cell TCR helpers
│   └── igblast.py            # IgBLAST junction analyzer
├── model_training/           # Training pipeline scripts
├── ocscc_analyses/           # OCSCC immunotherapy cohort analyses (Figure 2)
├── mcpas_gbm_analyses/       # McPAS antigen benchmark + GBM single-cell analyses (Figures 3, 4)
├── examples/                 # Three-format input examples + quickstart.py
├── .github/workflows/        # CI smoke test
├── pyproject.toml            # Package metadata
├── requirements.txt          # Curated minimal Python deps (20 pinned)
├── requirements-cluster-freeze.txt  # Full bert_pretrain venv freeze (271 deps)
├── r-requirements.txt        # CRAN packages used by gbm_*.R
├── CITATION.cff              # Citation metadata
├── LICENSE                   # MIT
└── README.md

`model_training/`

Scripts for training the CRAFT model, organized as a numbered pipeline. Each step has a .py script and a companion .sh SLURM launcher. All scripts take --project_dir as a required CLI flag; the 06_phase* training scripts additionally expose every hyperparameter as a CLI flag with the manuscript value as the default.

Step	Script prefix	Description
0	`00_pre_process_data`	Preprocess raw Adaptive immunoSEQ data
1	`01_format_data`	Format data for model input
2	`02_bart_tokenizer`	Train the BART tokenizer (use `--imgt_gene_path` for the IMGT human TRB CSV)
3	`03_tokenize_genes`	Tokenize V, D, J genes using IMGT hierarchical scheme
4	`04_cdr3_tokenizer`	Train the CDR3 nucleotide tokenizer
5	`05_tokenize_subregions`	Tokenize CDR3 subregions (V-D and D-J junctions)
6	`06_phase{1,2,3}*_train_model`	Multi-phase curriculum training (phases 1-3 with variants)
8	`08_test_decoding_greedy`	Greedy decoding evaluation
11	`11a-d_*`	Model quality metrics: CDR3 length distributions, perplexity, V/J length distributions, k-mer composition

`ocscc_analyses/`

Analyses of the oral cavity squamous cell carcinoma (OCSCC) immunotherapy cohort (Luoma et al., 18 patients with neoadjuvant checkpoint blockade).

Prefix	Description
`00_*`	Clone frequency computation, amino acid indexing, sample indexing
`01_craft_*`	CRAFT preprocessing, tokenization, embedding extraction, and pooling
`02_tcrbert_*`	TCR-BERT embedding extraction
`03_esm_*`	ESM-1b embedding extraction
`04_pca_*`	PCA dimensionality reduction
`05_metrics_*`	Dispersion and longitudinal repertoire metrics
`08_sceptr_*`	SCEPTR embedding extraction
`09_giana_*`	GIANA embedding extraction

Jupyter notebooks:

dispersion_metrics_hnscc.ipynb — Visualization of repertoire dispersion metrics (Figure 2)
longitudinal_metrics_hnscc.ipynb — Visualization of longitudinal repertoire remodeling (Figure 2)

`mcpas_gbm_analyses/`

Analyses for the McPAS-TCR antigen specificity benchmark (Figure 4) and the GBM single-cell immunotherapy cohort (Ling et al., oncolytic HSV-1 therapy; Figure 3).

Embedding pipelines (prefixed 01_ through 09_): same structure as ocscc_analyses/ for CRAFT, TCR-BERT, ESM, SCEPTR, and GIANA, with _nsclc variants for cross-cohort evaluation. 01_craft_02b_tokenize_subregions is an alternative tokenization.

IgBLAST processing (igblast_*): single-cell TCR realignment using IgBLAST with IMGT germline references, with sample-splitting scripts for each cohort (GBM, NSCLC, MIRA, TNBC, Yost).

GBM single-cell analyses (gbm_*):

Script	Description
`gbm_01*`	Baseline centroid shift for cell-type compartments
`gbm_02*`	Baseline shift for reactivity-stratified populations
`gbm_03_enrichment_tests`	Binomial tail enrichment across compartments and timepoints
`gbm_04-06_viz_*`	Visualization notebooks for cell-type shifts, reactivity, and enrichment (Figure 3)
`gbm_07_shift_labels`	Displacement-based clonotype labeling
`gbm_08_gex_de`	Differential gene expression (Wilcoxon rank-sum) — R script using `optparse`
`gbm_09_viz_gex`	Gene expression visualization (Figure 3d)
`gbm_fig_*`	R scripts for final figure assembly (ridgeplots, binomial heatmaps, DE plots)

McPAS-TCR analyses (mcpas_*):

mcpas_compute_metrics — Silhouette scores, kNN class-consistency, centroid clustering
mcpas_antigen_metrics_all_methods.ipynb — Benchmark comparison across all embedding methods (Figure 4)
mcpas_antigen_metrics_no_curr.ipynb — Ablation analysis

Requirements

Python ≥ 3.8 (manuscript was produced on Python 3.8.6; tested up to 3.10)
R 4.3.0 (for GBM figure generation and differential expression)
PyTorch (CPU is sufficient for inference; training requires CUDA)
IgBLAST (for single-cell TCR realignment)
IMGT germline reference databases

The Python deps are split into:

requirements.txt — curated minimal install for the CRAFT core (training, inference, ocscc_analyses, mcpas_gbm_analyses CRAFT/IgBLAST/GBM scripts).
requirements-cluster-freeze.txt — verbatim pip freeze of the JHU cluster venv used to produce the manuscript results, for exact reproducibility.

The benchmark-comparison scripts (03_esm_*, 08_sceptr_*, 09_giana_*) require their own dedicated environments due to conflicting upstream dependencies (TCRembedding, sceptr, tidytcells). The .sh launchers EDIT-tag the venv activation line so reproducers can swap envs per method.

R deps are listed in r-requirements.txt.

Running on a SLURM cluster

The .sh launchers under each analysis folder are SLURM batch files configured for the JHU/Rockfish cluster. To run them on a different cluster:

Edit the lines tagged # EDIT: cluster-specific (-A account, --partition=, --qos=).
Uncomment and update the # source /path/to/your/venv/bin/activate line in the scaffolding header.
export CRAFT_PROJECT_DIR=/path/to/your/project_root (or pass --project_dir directly to each script).

Data availability

Training data: Emerson et al. — 666 healthy donors, 131M productive TCR beta-chain rearrangements (Adaptive immunoSEQ).
OCSCC cohort: Luoma et al. — 18 patients with neoadjuvant checkpoint blockade (NCT02919683).
GBM cohort: Ling et al. — recurrent glioblastoma patients with oncolytic HSV-1 therapy (NCT03152318).
McPAS-TCR: 5,279 pathology-associated TCRs with antigen annotations.

Citation

If you use CRAFT in your research, please cite:

Balan A, Elhanati Y, Meza Landeros KE, et al. A generative reference
grammar of healthy TCR repertoires reveals cancer-associated immune
remodeling. (2026).

The full author list, affiliations, and machine-readable citation metadata are in CITATION.cff. The accompanying software release is archived on Zenodo at DOI 10.5281/zenodo.19891746.

The manuscript DOI and bioRxiv preprint URL will be added here once available.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRAFT: Cancer Repertoire Anomaly Finding Transformer

Quickstart

Download model weights and tokenizers

Repository structure

`model_training/`

`ocscc_analyses/`

`mcpas_gbm_analyses/`

Requirements

Running on a SLURM cluster

Data availability

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
craft		craft
examples		examples
mcpas_gbm_analyses		mcpas_gbm_analyses
model_training		model_training
ocscc_analyses		ocscc_analyses
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
r-requirements.txt		r-requirements.txt
requirements-cluster-freeze.txt		requirements-cluster-freeze.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CRAFT: Cancer Repertoire Anomaly Finding Transformer

Quickstart

Download model weights and tokenizers

Repository structure

model_training/

ocscc_analyses/

mcpas_gbm_analyses/

Requirements

Running on a SLURM cluster

Data availability

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`model_training/`

`ocscc_analyses/`

`mcpas_gbm_analyses/`

Packages