Skip to content

facex-engine/facex

Repository files navigation

FaceX

Full face pipeline — detect, mesh, recognize, anti-spoof — in pure WebAssembly. Trained from scratch. No cloud, no Python, no server.

License: Apache 2.0 LFW Latency Browser Encryption Deps

Full face stack that runs entirely in the browser. Detection, 98-point landmarks, dense 3D mesh, recognition, and passive anti-spoof — all WebAssembly, zero server, ~17 MB of encrypted weights.

🎬 Live Demo → — open in a Chromium browser, press Start camera, try all modes. 📚 Docs in Wiki → — Browser quickstart, training recipes, nn2 architecture, encrypted weights, comparison vs alternatives.

FaceX pipeline

Everything in the demo is trained by us

Component Status Size Source
Face detector ours 401 KB YuNet-style FCOS, WIDER FACE
98-point landmark ours 1.1 MB WFLW
576-point 3D mesh ours 5.6 MB MediaPipe distillation
Recognition (4 sizes) ours 0.8–8.4 MB MobileFaceNet + ArcFace on MS1M, LFW 95.62 → 99.07% (10-fold mean)
Anti-spoof Apache 2.0 2 × 1.7 MB MiniFASNet (MinivisionAI Silent-Face)

Weights ship as AES-256-GCM ciphertext and are decrypted in the browser via WebCrypto. This is not DRM — a determined attacker can dump the decrypted bytes from the WASM heap. What it does buy you: friction against casual scraping, per-customer key revocation for SaaS deployments, and an audit trail at the key-issuing endpoint. See the wiki for the threat model and Express / FastAPI integration recipes.


The full surveillance stack — no Python, no FFmpeg, no GPU

FaceX is one piece of a larger pure-C stack we built for IP-camera workloads. Every component is hand-written, zero-dependency, flashable to firmware:

Component What it does Size Speed Replaces
NexusDecode H.264 + H.265 decoder, RTSP client 184 KB 6,300 fps, 46× FFmpeg libav / FFmpeg
NexusEncode H.265/HEVC encoder ~250 KB x265-medium quality, 131 fps x265
NXV codec Surveillance-tuned video format 121 KB 3× smaller than H.265, instant seek, change-map H.265 + custom container
nn2 YOLOv8 + MiniFASNet inference engine 520 KB 8.5 ms @ 320, 1.5–2× ONNX RT onnxruntime
FaceX (this repo) Detect + landmarks + embed + spoof 148 KB native / 17 MB WASM 3 ms/face dlib, FaceNet, InsightFace

Pipeline numbers (one Intel i5 CPU):

  • Decode 30 RTSP streams + run YOLO detection on each: 0.56 ms/frame average → 70 IP cameras on one CPU core with motion-gating + Kalman tracking.
  • Tiered storage: 70 cams × 90 days = 49 TB → 3.3 TB (15× savings) with NXV + selective bitstream-only archiving.

Why it matters:

  • Flashable — entire NVR stack fits in <2 MB of binary, ARM/x86/RISC-V, no shared libraries
  • No FFmpeg — no GPL contamination, no surface for codec CVEs, no 28 MB of libav .so files
  • Embedded-ready — runs on $30 SoCs (Allwinner, Rockchip, NXP i.MX), 25 cameras on 27% CPU
  • Standalone — every piece can be used alone or combined: decoder → motion gate → detector → tracker → recognizer → archive

Where it runs

We're not just "x86 only". The same code targets multiple device classes:

Target Status What's used
Browser (any modern Chromium/Firefox/Safari) ✅ shipping onnxruntime-web + AES-256-GCM weight decryption (live demo)
Linux / macOS / Windows x86-64 ✅ shipping AVX2 + AVX-512 + VNNI runtime dispatch
Apple Silicon (M1–M4) ✅ in PR #3 NEON + Accelerate (AMX) + SME on M4+ + Core ML / ANE bridge
ARM Linux / Android (AArch64) ✅ in PR #3 Hand-written NEON kernels for FP32 GEMM
NXP i.MX 8 / 93 / 95 NPU 🛠️ draft (#3) Ethos-U65 / VxDelegate / XNNPACK
Espressif ESP32-P4 (RISC-V + PIE 128) 🛠️ draft (#3) ESP-IDF component + MIPI-CSI camera example
Firmware / bare-metal MCU 🛠️ in progress No libc deps in core; PReLU/GEMM/Conv kernels fit in 64 KB SRAM

Decoder + encoder are pure C99 with x86 SIMD today; ARM/NEON backports for NexusDecode are next.

// Native C: 3 ms per face
#include "facex.h"
FaceX* fx = facex_init("facex_xs.bin", NULL);
float emb[512];
facex_embed(fx, face_112x112, emb);
float sim = facex_similarity(emb_a, emb_b);   // >0.3 = same person
# Or run the live browser demo locally
git clone https://github.com/facex-engine/facex
cd facex/wasm && python -m http.server 8000
# open http://127.0.0.1:8000/demo_mesh.html

What can you build with this?

  • Identity verification (KYC) — "is this the same person?" from selfie + ID photo, no cloud round-trip
  • Face login — unlock apps by face, works offline, no data leaves the device
  • Access control — doors, gates, turnstiles on edge hardware without GPU
  • Proctoring — verify exam takers are who they claim to be
  • Smart cameras — recognize known faces at 300+ faces/sec on a single CPU core
  • Banking / fintech onboarding — passive liveness + face match in the browser, GDPR-friendly by construction
  • In-store kiosks — VIP/loyalty recognition at the till, runs on a $30 SoC

Why FaceID with FaceX instead of cloud APIs

You're typically choosing between AWS Rekognition / Azure Face / Google Vision / Paravision / FaceTec ZoOm. Cost comparison for a 100 K-user app doing one face-match per session per day:

Provider Price per 1k matches Monthly cost (100 K MAU × 1/day) Sends user faces to Latency
AWS Rekognition CompareFaces $1.00 $3,000 /mo AWS us-east 250–500 ms
Azure Face API verify $1.00–$1.50 $3,000–$4,500 /mo Azure region 200–400 ms
Google Vision FACE_DETECTION $1.50 $4,500 /mo Google datacenter 200–400 ms
FaceTec ZoOm per-seat licensed $10 K+ /year Their SDK, mixed 1–3 s (active)
FaceX in your app $0 $0 Nobody — stays in the user's browser 20–30 ms

The savings are nice. The bigger story is compliance: when frames never leave the device, you're outside GDPR Art. 9 (biometric) / HIPAA / Russia's 152-ФЗ / KZ's data localization rules by construction. No DPIA, no DPA renegotiations, no "where are the photos stored" audit questions.

Where it's been deployed

We've shipped this stack into IP-camera NVRs, retail kiosks, and KYC flows for fintech clients. If you're evaluating it for production, the live demo is the fastest way to see what it can do — then open an issue or email me with your use case and I'll help you scope.

How it works

Full pipeline, every step trained or written by us:

  1. Detect — own FCOS-style face detector (100K params, trained from scratch on WIDER FACE; 401 KB ONNX).
  2. Align — 98-point WFLW landmark ConvNet (1.15M params; 1.1 MB ONNX).
  3. 3D mesh — 576-point face mesh (5.6 MB ONNX), distilled from MediaPipe FaceMesh with our 98 WFLW anchors driving the warp.
  4. Recognize — MobileFaceNet + ArcFace, four size variants (nano 0.8 MB · tiny 1.8 MB · standard 3.9 MB · xs 8.4 MB), LFW 95.6 → 99.07%.
  5. Anti-spoof — MiniFASNet ensemble (V2 @ 2.7 + V1SE @ 4.0), MinivisionAI Apache 2.0. Also ported to our nn2 engine — 2× faster than ONNX Runtime on the same CPU.

Two modes:

  • Browser: onnxruntime-web + AES-256-GCM encrypted weights, full pipeline in ~25 ms/frame, no server.
  • Native: pure C, 3 ms per face, INT8 + AVX-512, beats ONNX Runtime on the same hardware.

Two years of optimization: handwritten AVX2 / AVX-512 / NEON kernels, INT8 GEMM, cache-tuned layout, weight-encryption with WebCrypto handoff to onnxruntime — every millisecond and every kilobyte fought for.


Benchmarks

Measured on Intel i5-11500 (6 cores, AVX-512 + VNNI):

Speed — recognition (our MobileFaceNet xs)

Speed comparison

Engine Median Min vs FaceX
FaceX (native nn2) 3.0 ms 2.87 ms --
ONNX Runtime 1.23 3.9 ms 3.18 ms 1.30× slower
InsightFace (R34) 17 ms -- 5.7× slower
FaceNet (PyTorch) 30 ms -- 10× slower
dlib 50+ ms -- 17× slower

Speed — anti-spoof (MiniFASNet V2+V1SE ensemble)

Same model, ported to our nn2 C engine (Apache 2.0, source in nn2/):

Engine Single model Ensemble Speedup
nn2 0.70 ms 1.43 ms --
ONNX Runtime 1.23 1.33 ms 2.92 ms 2.03× slower

Byte-identical predictions to PyTorch / ONNX on the same input.

Accuracy — recognition (LFW verification)

All numbers are the mean accuracy across 10-fold cross-validation (InsightFace-style: tune the threshold on 9 training folds, evaluate on the 1 held-out fold, repeat 10 times). The ± column is the standard deviation across folds. Input must be 112×112 ArcFace-aligned via a 5-point similarity transform — running on un-aligned crops drops accuracy by ~25 points. The eval script is training/scripts/lfw_eval.py.

Variant Params LFW mean ± std ONNX size Speed (CPU)
nano 0.20 M 95.62% 1.11% 0.8 MB 1.4 ms
tiny 0.45 M 96.85% 0.87% 1.8 MB 2.1 ms
standard 0.93 M 98.25% 0.68% 3.9 MB 2.6 ms
xs 2.07 M 99.07% 0.40% 8.4 MB 3.0 ms

Accuracy — face detection (WIDER FACE val)

Our YuNet-style FCOS detector, 100 K params, trained from scratch:

Metric Score
Best recall @ IoU 0.5 (all faces incl. tiny) 27.5%
Recall on faces ≥ 32 px ~85%
Recall on webcam-distance faces ~95%
ONNX size 401 KB
Latency on 320×320 input < 1 ms (WASM)

Footprint

Footprint comparison

Metric FaceX ONNX Runtime
Library size 148 KB 28 MB
Total deploy 7 MB 157 MB
Dependencies none Python + onnxruntime
Cold start ~100 ms ~350 ms

Quick start

C

#include "facex.h"

int main() {
    // Load engine (one-time, ~100ms)
    FaceX* fx = facex_init("facex_xs.bin", NULL);

    // Compute embedding (3ms per call)
    float face[112 * 112 * 3];  // RGB, HWC, [-1, 1]
    float embedding[512];
    facex_embed(fx, face, embedding);

    // Compare two faces
    float sim = facex_similarity(emb_a, emb_b);
    // sim > 0.3 → same person

    facex_free(fx);
}
gcc -O3 -march=native -Iinclude -o myapp myapp.c -L. -lfacex -lm -lpthread

Go

import "github.com/facex-engine/facex/go/facex"

ff, _ := facex.New(facex.Config{
    Exe:     "./facex-cli",
    Weights: "./facex_xs.bin",
})
defer ff.Close()

embedding, _ := ff.Embed(rgbImage)
sim := facex.CosSim(embA, embB)

CLI (any language via stdin/stdout)

# Pipe mode: reads 112x112x3 float32 HWC, writes 512 float32
./facex-cli weights.bin --server < faces.raw > embeddings.raw

Browser (via onnxruntime-web + AES decryption)

<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/ort.min.js"></script>
<script>
  // Fetch encrypted weights, decrypt in WebCrypto, hand bytes to ORT.
  const buf = new Uint8Array(await (await fetch('facex_xs.enc')).arrayBuffer());
  const iv = buf.subarray(0, 12), data = buf.subarray(12);
  const key = await crypto.subtle.importKey('raw', KEY_BYTES,
                                              {name:'AES-GCM'}, false, ['decrypt']);
  const onnx = new Uint8Array(await crypto.subtle.decrypt({name:'AES-GCM', iv}, key, data));
  const sess = await ort.InferenceSession.create(onnx, { executionProviders: ['wasm'] });
  // Inference is 100% client-side. Frames never leave the device.
</script>

Full browser pipeline (detect + 576pt mesh + recognize + anti-spoof) is live at https://facex-engine.github.io/facex/demo/ — open it, press Start camera, try the picker.


Build

make            # builds libfacex.a + facex-cli
make example    # builds and runs example
make encrypt    # builds weight encryption tool

Requirements: GCC with AVX2 support. Nothing else.

Cross-compile for Linux (from WSL)

gcc -O3 -march=x86-64-v3 -mavx2 -mfma -static \
    -DFACEX_LIB -o libfacex.a src/*.c -lm -lpthread

API

// Initialize engine. Returns NULL on error.
// license_key: NULL for plain weights, or key string for AES-256 encrypted.
FaceX* facex_init(const char* weights_path, const char* license_key);

// Compute 512-dim face embedding from 112x112 RGB image.
// rgb_hwc: float32 array [112][112][3], values in [-1, 1].
// embedding: output buffer, 512 floats (L2-normalized).
int facex_embed(FaceX* fx, const float* rgb_hwc, float embedding[512]);

// Cosine similarity between two embeddings. Range [-1, 1].
float facex_similarity(const float emb1[512], const float emb2[512]);

// Free engine resources.
void facex_free(FaceX* fx);

// Version string.
const char* facex_version(void);

Architecture (recognition, MobileFaceNet xs)

Input: 112×112 RGB float32 in [-1, 1]
    ↓
  Stem: Conv 3×3 s=2 → 64 ch, PReLU
    ↓
  DW Stem: DW 3×3 s=1 → 64 ch, PReLU
    ↓
  Stage 1: 5× Inverted-Residual (t=2, c=64, first s=2)
    ↓
  Stage 2: 1× Inverted-Residual (t=4, c=128, s=2)
    ↓
  Stage 3: 6× Inverted-Residual (t=2, c=128, s=1)
    ↓
  Stage 4: 1× Inverted-Residual (t=4, c=128, s=2)
    ↓
  Stage 5: 2× Inverted-Residual (t=2, c=128, s=1)
    ↓
  Conv 1×1 → 512 ch, PReLU
    ↓
  GDConv DW 7×7 s=1 (linear-GDC) → 512×1×1
    ↓
  1×1 conv → 512-d embedding, BN, L2-norm
    ↓
Output: 512-dim unit embedding

Engine internals:

  • Pure C99 + SIMD intrinsics (AVX2, FMA, AVX-512, VNNI)
  • INT8 quantized GEMM with vpmaddubsw (AVX2) / vpdpbusd (VNNI)
  • FP32 packed column-panel MatMul (NR = 8 AVX2, NR = 16 AVX-512)
  • Custom thread pool with work-stealing (WaitOnAddress / futex)
  • Pre-packed weights at load time for cache-optimal access
  • BN folded into preceding Conv at export time
  • AES-256-GCM weight encryption with WebCrypto handoff in the browser, AES-256-CTR with hardware binding for native deployments
  • Fully shared op library between recognition, anti-spoof (MiniFASNet), and YOLOv8 detection (nn2)

Weight encryption

For commercial deployment with IP protection:

# Encrypt weights (binds to target machine hardware)
./facex-encrypt encrypt weights.bin weights.enc "LICENSE-KEY"

# Load encrypted weights
FaceX* fx = facex_init("weights.enc", "LICENSE-KEY");

Wrong key or different machine → load fails. Original weights never touch disk in plaintext on the target machine.


Integration paths

Language Method Latency
C / C++ libfacex.a + facex.h 3 ms (native)
Browser facex.wasm (48 KB) 7 ms (WASM SIMD)
Go go/facex subprocess ~4 ms
Python subprocess / ctypes ~4 ms
Any facex-cli --server stdin/stdout ~4 ms

Limitations

  • Native build — currently x86-64 (AVX2 / AVX-512 / VNNI). ARM NEON paths exist in nn2/src/gemm_neon.h; full ARM build script is on the roadmap, ESP32 / RISC-V PIE 128 next.
  • Browser pipeline — uses onnxruntime-web with WebCrypto-decrypted ONNX. WebGPU backend is supported by ORT but not yet wired into the demo; would drop inference by another 3–5×.
  • Anti-spoof is the only non-our component (MiniFASNet, Apache 2.0, MinivisionAI). Training a fully-own anti-spoof needs a commercial attack dataset, which we don't have.

Models

Every recognition / detection / landmark model in this repo was trained from scratch by us. Anti-spoof is the only third-party piece.

Recognition (our MobileFaceNet variants)

Standard MobileFaceNet (Chen et al. 2018) topology, width-scaled to four sizes, ArcFace head with the numerically-stable angle-addition margin, trained on MS1M-RefineV2 with bf16 autocast.

Variant Params Width mult Embedding dim LFW
nano 0.20 M 0.36 256 95.62%
tiny 0.45 M 0.55 512 96.85%
standard 0.93 M 0.90 512 98.25%
xs 2.07 M 1.35 512 99.07%

Face detector (ours)

YuNet-inspired, but FCOS-style anchor-free. MobileNetV2-lite backbone, 3 detection heads at strides 8 / 16 / 32, GIoU bbox loss + focal cls loss. 100 K params, 401 KB ONNX. Trained on WIDER FACE.

98-point landmarks (ours, WFLW)

MobileFaceNet-style backbone + dense head, 1.15 M params. Final NME on WFLW val: 4.85% (test) / 5.95% (large-pose subset).

576-point 3D mesh (ours, MediaPipe distillation)

Same architecture as the 98-point model, but with Linear(256, 478*3) head — distilled from MediaPipe FaceMesh pseudo-labels with TPS-rendered supervision over our WFLW frontalised crops. Error: xy 0.54 px, z 0.51 (normalized) on held-out val. With 98 WFLW anchors driving the non-rigid warp, the rendered mesh has 576 visible points total.

Anti-spoof (MiniFASNet, Apache 2.0, MinivisionAI)

We don't train this — there's no commercial-friendly attack dataset publicly available. We port their two-model ensemble (V2 @ 2.7 + V1SE @ 4.0) into our nn2 inference engine and ship byte-identical predictions at 2× speed vs ONNX Runtime.


Repo layout

include/                — public C API (facex.h, facex_mfn.h, ...)
src/                    — recognition engine + AES weight crypto
nn2/                    — pure-C YOLO + MiniFASNet inference engine
                          (1.5–2× ONNX, Apache 2.0)
   src/                 — gemm, conv, ops, antispoof_ops, minifasnet
   include/             — public API headers
   tools/               — PyTorch → .bin converters
wasm/                   — browser demo (demo_mesh.html, encrypt tool)
   tools/encrypt_models.py — AES-256-GCM encrypt all .onnx
docs/demo/              — GitHub Pages live demo + encrypted weights
training/               — all training pipelines, datasets, exporters
   scripts/             — MobileFaceNet recognition (nano/tiny/standard/xs)
   landmark/            — 98-point WFLW
   landmark3d/          — 576-point MediaPipe distillation
   face_detect/         — own FCOS face detector trained on WIDER FACE
   antispoof/           — MiniFASNet integration
go/facex/               — Go binding (subprocess protocol)
python/facex/           — Python binding (ctypes)

FAQ

Q: Is it really faster than ONNX Runtime? A: Yes. Measured on the same CPU, same model, same input. FaceX median 3.0 ms vs ONNX Runtime median 3.9 ms. The gap comes from handwritten SIMD kernels that avoid framework overhead.

Q: What accuracy vs ArcFace-R100? A: Our xs (2 M params) is 99.07% LFW vs ArcFace-R100's 99.80%. 0.7% of recall for 50× smaller model and 10× faster inference.

Q: Can I use this commercially? A: Engine code is Apache 2.0. Our trained recognition, detection, landmark, and 3D-mesh weights are also Apache 2.0 — we own them. Only the anti-spoof component (MiniFASNet) is upstream Apache 2.0.

Q: Does it do face detection? A: Yes. We trained an own FCOS-style detector on WIDER FACE; it replaces YuNet in the browser demo and runs in <1 ms.

Q: Why ONNX in the browser instead of native WASM? A: We went both ways. nn2 ships a native C engine that is 1.5–2× faster than ORT. For the browser, onnxruntime-web gives us WebGPU, SIMD-WASM, and 3-line model swap without re-compiling. The encryption layer (WebCrypto → ORT byte stream) sits between the network and ORT, so the model bytes never hit the page as plaintext.


Citation

@software{facex2026,
  author  = {Atinov, Baurzhan},
  title   = {FaceX: Fast CPU Face Embedding Library},
  year    = {2026},
  url     = {https://github.com/facex-engine/facex}
}

License

Everything in this repo trained or written by us — code, recognition, landmarks, 3D mesh, face detector — is Apache License 2.0. Free for commercial use, attribution appreciated.

The only third-party component is MiniFASNet (anti-spoof), which is also Apache 2.0 from MinivisionAI Silent-Face-Anti-Spoofing.

For commercial licensing: bauratynov@gmail.com


Created by Baurzhan Atinov (Kazakhstan)
GitHub