Code accompanying the manuscript:
A generative reference grammar of healthy TCR repertoires reveals cancer-associated immune remodeling Balan et al. (2026)
CRAFT is a conditional sequence-to-sequence transformer built on the BART architecture that learns the generative grammar of T cell receptor (TCR) beta-chain recombination. Trained on 131 million productive rearrangements from 666 healthy donors, CRAFT encodes V, D, and J gene usage alongside nucleotide-level CDR3 structure to produce biologically grounded TCR embeddings via teacher-forced inference.
# 1. Clone the repo and install the craft package in editable mode.
git clone https://github.com/KarchinLab/CRAFT.git
cd CRAFT
pip install -e .
pip install -r requirements.txt
# 2. Point CRAFT at your project directory.
export CRAFT_PROJECT_DIR=$PWD
# 3. Download the trained model checkpoint and tokenizers from Zenodo
# (DOI 10.5281/zenodo.19891746) and unpack them under $CRAFT_PROJECT_DIR
# so the on-disk layout is:
# tokenizers/cdr3_tokenizer.json
# tokenizers/bart_custom_tokenizer/
# models/06_model_phase3c/final_model.pth
# See the "Download model weights" section below for the exact recipe.
# 4. Run the quickstart with a 10-row example file.
python examples/quickstart.py --input examples/example_adaptive_immunoseq.tsvSee examples/README.md for a tour of the three accepted input formats (Adaptive immunoSEQ, AIRR-C, post-IgBLAST).
The trained CRAFT checkpoint, the BART custom tokenizer, and the CDR3 BPE tokenizer are deposited on Zenodo:
The deposition is a single zip file (craft_zenodo_upload.zip, ~591 MB compressed) — packaged this way so the tokenizers/bart_custom_tokenizer/ subdirectory survives Zenodo's upload, which would otherwise flatten loose files. Download and unpack it from your project root:
cd $CRAFT_PROJECT_DIR
curl -L "https://zenodo.org/records/19891746/files/craft_zenodo_upload.zip" -o craft_zenodo_upload.zip
unzip craft_zenodo_upload.zipThe unpacked layout is exactly what every downstream script expects by default:
$CRAFT_PROJECT_DIR/
├── tokenizers/
│ ├── cdr3_tokenizer.json
│ └── bart_custom_tokenizer/
│ ├── added_tokens.json
│ ├── merges.txt
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── vocab.json
└── models/
└── 06_model_phase3c/
└── final_model.pth
So most invocations only need --project_dir "$CRAFT_PROJECT_DIR".
CRAFT/
├── craft/ # Installable Python package (model, datasets, training, inference)
│ ├── model.py # GeneBARTModelComposite, GeneBARTModelSingleHead, AttentionPooling
│ ├── datasets.py # GeneDatasetTrain, GeneDatasetInference, GeneDatasetPerplexity
│ ├── training.py # Training loop & utilities
│ ├── embeddings.py # Inference utilities (incl. format_tcr_dataframe)
│ ├── plotting.py # Plotting helpers
│ ├── sctcr.py # Single-cell TCR helpers
│ └── igblast.py # IgBLAST junction analyzer
├── model_training/ # Training pipeline scripts
├── ocscc_analyses/ # OCSCC immunotherapy cohort analyses (Figure 2)
├── mcpas_gbm_analyses/ # McPAS antigen benchmark + GBM single-cell analyses (Figures 3, 4)
├── examples/ # Three-format input examples + quickstart.py
├── .github/workflows/ # CI smoke test
├── pyproject.toml # Package metadata
├── requirements.txt # Curated minimal Python deps (20 pinned)
├── requirements-cluster-freeze.txt # Full bert_pretrain venv freeze (271 deps)
├── r-requirements.txt # CRAN packages used by gbm_*.R
├── CITATION.cff # Citation metadata
├── LICENSE # MIT
└── README.md
Scripts for training the CRAFT model, organized as a numbered pipeline. Each step has a .py script and a companion .sh SLURM launcher. All scripts take --project_dir as a required CLI flag; the 06_phase* training scripts additionally expose every hyperparameter as a CLI flag with the manuscript value as the default.
| Step | Script prefix | Description |
|---|---|---|
| 0 | 00_pre_process_data |
Preprocess raw Adaptive immunoSEQ data |
| 1 | 01_format_data |
Format data for model input |
| 2 | 02_bart_tokenizer |
Train the BART tokenizer (use --imgt_gene_path for the IMGT human TRB CSV) |
| 3 | 03_tokenize_genes |
Tokenize V, D, J genes using IMGT hierarchical scheme |
| 4 | 04_cdr3_tokenizer |
Train the CDR3 nucleotide tokenizer |
| 5 | 05_tokenize_subregions |
Tokenize CDR3 subregions (V-D and D-J junctions) |
| 6 | 06_phase{1,2,3}*_train_model |
Multi-phase curriculum training (phases 1-3 with variants) |
| 8 | 08_test_decoding_greedy |
Greedy decoding evaluation |
| 11 | 11a-d_* |
Model quality metrics: CDR3 length distributions, perplexity, V/J length distributions, k-mer composition |
Analyses of the oral cavity squamous cell carcinoma (OCSCC) immunotherapy cohort (Luoma et al., 18 patients with neoadjuvant checkpoint blockade).
| Prefix | Description |
|---|---|
00_* |
Clone frequency computation, amino acid indexing, sample indexing |
01_craft_* |
CRAFT preprocessing, tokenization, embedding extraction, and pooling |
02_tcrbert_* |
TCR-BERT embedding extraction |
03_esm_* |
ESM-1b embedding extraction |
04_pca_* |
PCA dimensionality reduction |
05_metrics_* |
Dispersion and longitudinal repertoire metrics |
08_sceptr_* |
SCEPTR embedding extraction |
09_giana_* |
GIANA embedding extraction |
Jupyter notebooks:
dispersion_metrics_hnscc.ipynb— Visualization of repertoire dispersion metrics (Figure 2)longitudinal_metrics_hnscc.ipynb— Visualization of longitudinal repertoire remodeling (Figure 2)
Analyses for the McPAS-TCR antigen specificity benchmark (Figure 4) and the GBM single-cell immunotherapy cohort (Ling et al., oncolytic HSV-1 therapy; Figure 3).
Embedding pipelines (prefixed 01_ through 09_): same structure as ocscc_analyses/ for CRAFT, TCR-BERT, ESM, SCEPTR, and GIANA, with _nsclc variants for cross-cohort evaluation. 01_craft_02b_tokenize_subregions is an alternative tokenization.
IgBLAST processing (igblast_*): single-cell TCR realignment using IgBLAST with IMGT germline references, with sample-splitting scripts for each cohort (GBM, NSCLC, MIRA, TNBC, Yost).
GBM single-cell analyses (gbm_*):
| Script | Description |
|---|---|
gbm_01* |
Baseline centroid shift for cell-type compartments |
gbm_02* |
Baseline shift for reactivity-stratified populations |
gbm_03_enrichment_tests |
Binomial tail enrichment across compartments and timepoints |
gbm_04-06_viz_* |
Visualization notebooks for cell-type shifts, reactivity, and enrichment (Figure 3) |
gbm_07_shift_labels |
Displacement-based clonotype labeling |
gbm_08_gex_de |
Differential gene expression (Wilcoxon rank-sum) — R script using optparse |
gbm_09_viz_gex |
Gene expression visualization (Figure 3d) |
gbm_fig_* |
R scripts for final figure assembly (ridgeplots, binomial heatmaps, DE plots) |
McPAS-TCR analyses (mcpas_*):
mcpas_compute_metrics— Silhouette scores, kNN class-consistency, centroid clusteringmcpas_antigen_metrics_all_methods.ipynb— Benchmark comparison across all embedding methods (Figure 4)mcpas_antigen_metrics_no_curr.ipynb— Ablation analysis
- Python ≥ 3.8 (manuscript was produced on Python 3.8.6; tested up to 3.10)
- R 4.3.0 (for GBM figure generation and differential expression)
- PyTorch (CPU is sufficient for inference; training requires CUDA)
- IgBLAST (for single-cell TCR realignment)
- IMGT germline reference databases
The Python deps are split into:
requirements.txt— curated minimal install for the CRAFT core (training, inference, ocscc_analyses, mcpas_gbm_analyses CRAFT/IgBLAST/GBM scripts).requirements-cluster-freeze.txt— verbatimpip freezeof the JHU cluster venv used to produce the manuscript results, for exact reproducibility.
The benchmark-comparison scripts (03_esm_*, 08_sceptr_*, 09_giana_*) require their own dedicated environments due to conflicting upstream dependencies (TCRembedding, sceptr, tidytcells). The .sh launchers EDIT-tag the venv activation line so reproducers can swap envs per method.
R deps are listed in r-requirements.txt.
The .sh launchers under each analysis folder are SLURM batch files configured for the JHU/Rockfish cluster. To run them on a different cluster:
- Edit the lines tagged
# EDIT: cluster-specific(-A account,--partition=,--qos=). - Uncomment and update the
# source /path/to/your/venv/bin/activateline in the scaffolding header. export CRAFT_PROJECT_DIR=/path/to/your/project_root(or pass--project_dirdirectly to each script).
- Training data: Emerson et al. — 666 healthy donors, 131M productive TCR beta-chain rearrangements (Adaptive immunoSEQ).
- OCSCC cohort: Luoma et al. — 18 patients with neoadjuvant checkpoint blockade (NCT02919683).
- GBM cohort: Ling et al. — recurrent glioblastoma patients with oncolytic HSV-1 therapy (NCT03152318).
- McPAS-TCR: 5,279 pathology-associated TCRs with antigen annotations.
If you use CRAFT in your research, please cite:
Balan A, Elhanati Y, Meza Landeros KE, et al. A generative reference
grammar of healthy TCR repertoires reveals cancer-associated immune
remodeling. (2026).
The full author list, affiliations, and machine-readable citation metadata are in CITATION.cff. The accompanying software release is archived on Zenodo at DOI 10.5281/zenodo.19891746.
The manuscript DOI and bioRxiv preprint URL will be added here once available.
MIT — see LICENSE.
Copyright © 2026 Karchin Lab, Johns Hopkins University.