A Swift framework for real-time face tracking on iOS, built for research. Supports both ARKit (TrueDepth camera devices) and MediaPipe (any device with a front camera).
- Real-time blend shapes, face mesh vertices, gaze tracking, and light estimation
- FACS Action Units — 14 Action Unit intensities computed automatically from blend shapes
- Head pose — pitch, yaw, roll in radians (ARKit only)
- Event markers — timestamped labels for aligning face data with experimental stimuli
- Async stream API for live frame access
- Built-in session storage with CSV, JSON Lines, and HDF5 export
- Optional image and depth map capture
- MediaPipe model is downloaded and cached automatically
- iOS 17+
- Swift 6.0+
Add FaceTrackingKit to your project via Swift Package Manager:
dependencies: [
.package(url: "https://github.com/digital-medicine/FaceTrackingKit", from: "0.1.0")
]ARKit — High accuracy, requires Face ID device (iPhone/iPad with TrueDepth camera). Provides 52 blend shapes, gaze tracking, light estimation, and depth maps.
MediaPipe — Works on any iOS device with a front camera. Provides 51 blend shapes and 478 face landmarks. The model (~4 MB) is downloaded automatically on first use.
Add to your Info.plist:
NSCameraUsageDescription
This app uses the camera for face tracking.
import FaceTrackingKit
// Create a tracker — pick one:
let tracker = FaceTracker(provider: .arKit()) // ARKit (Face ID devices)
let tracker = FaceTracker(provider: .mediaPipe()) // MediaPipe (any device)
// Start a session
try await tracker.start(participant: "P001")
// Mark experimental events
tracker.addEvent("stimulus_onset")
// Read frames in real time
for await frame in tracker.frames {
if let aus = frame.actionUnits {
print("AU12 (smile): \(aus[.au12] ?? 0)")
}
if let pose = frame.headPose {
print("Head yaw: \(pose.yaw) rad")
}
}
// Stop and get a summary
let result = try await tracker.stop()
print("Captured \(result.frameCount) frames over \(result.duration)s")let documentsURL = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first!
let exportResult = try await tracker.export(
session: result.sessionID,
to: documentsURL,
options: .init(tabularFormat: .csv, includeImages: false)
)
print("Exported to: \(exportResult.directory.path)")
// Produces:
// session_P001_2025-06-15/
// session.json — session metadata
// blendshapes.csv — one row per frame, one column per blend shape
// metadata.csv — timestamps, gaze, light, head pose, action units
// events.json — timestamped event markers (if any)let tracker = FaceTracker(provider: .arKit(.init(
captureBlendShapes: true,
captureLookAtPoint: true,
captureLightEstimation: true,
captureDistanceToScreen: true,
captureHeadPose: true,
vertices: .all(precision: .float32),
captureImages: .everyNthFrame(10),
captureDepthMaps: .none,
maxTrackedFaces: 1
)))let tracker = FaceTracker(provider: .mediaPipe(.init(
captureBlendShapes: true,
vertices: .all(precision: .float32),
captureImages: .everyNthFrame(10),
maxTrackedFaces: 1
)))To use a bundled model file instead of the automatic download:
let tracker = FaceTracker(provider: .mediaPipe(.init(
modelPath: Bundle.main.url(forResource: "face_landmarker", withExtension: "task")!
))).none // No vertices (default)
.all(precision: .float32) // All vertices, full precision
.all(precision: .float16) // All vertices, half precision (~50% smaller)
.subset(indices: [0, 10, 20], precision: .float32) // Specific vertex indices
.stride(4, precision: .float32) // Every 4th vertex.none // No images (default)
.everyFrame // Every frame (~100-150 MB/min)
.everyNthFrame(5) // Every 5th frame// List all stored sessions
let sessions = try await tracker.listSessions()
for session in sessions {
print("\(session.participant): \(session.frameCount) frames, \(session.storageSizeBytes) bytes")
}
// Delete a session
try await tracker.deleteSession(session.sessionID)try await tracker.start(participant: "P001")
// Pause tracking (data is preserved)
try tracker.pause()
// Resume tracking (appends to the same session)
try await tracker.resume()
let result = try await tracker.stop()do {
try await tracker.start(participant: "P001")
} catch FaceTrackerError.providerUnavailable(let name) {
print("Provider \(name) is not supported on this device")
} catch FaceTrackerError.modelDownloadFailed(let error) {
print("Could not download MediaPipe model: \(error)")
} catch FaceTrackerError.permissionDenied {
print("Camera access was denied")
}FaceTrackingKit automatically computes 14 FACS Action Unit intensities from blend shapes (based on EMFACS Table 1, Aldenhoven et al.). Action units are available on every frame when blend shapes are enabled — no additional configuration needed.
for await frame in tracker.frames {
if let aus = frame.actionUnits {
print("AU6 (cheek raise): \(aus[.au6] ?? 0)")
print("AU12 (smile): \(aus[.au12] ?? 0)")
}
}You can also compute action units from arbitrary blend shape dictionaries:
let aus = FaceTracker.actionUnits(from: blendShapes)| AU | Name | Blend Shapes |
|---|---|---|
| AU1 | Inner Brow Raise | browInnerUp |
| AU2 | Outer Brow Raise | mean(browOuterUpLeft, browOuterUpRight) |
| AU4 | Brow Lowerer | mean(browDownLeft, browDownRight) |
| AU5 | Upper Lid Raise | mean(eyeWideLeft, eyeWideRight) |
| AU6 | Cheek Raise | mean(cheekSquintLeft, cheekSquintRight) |
| AU7 | Lid Tightener | mean(eyeSquintLeft, eyeSquintRight) |
| AU9 | Nose Wrinkler | mean(noseSneerLeft, noseSneerRight) |
| AU12 | Lip Corner Puller | mean(mouthSmileLeft, mouthSmileRight) |
| AU14 | Dimpler | mean(mouthDimpleLeft, mouthDimpleRight) |
| AU15 | Lip Corner Depressor | mean(mouthFrownLeft, mouthFrownRight) |
| AU16 | Lower Lip Depressor | mean(mouthLowerDownLeft, mouthLowerDownRight) |
| AU20 | Lip Stretcher | mean(mouthStretchLeft, mouthStretchRight) |
| AU23 | Lip Tightener | mouthPucker |
| AU26 | Jaw Drop | jawOpen |
Head orientation (pitch, yaw, roll) is extracted from the ARKit face anchor transform. Enabled by default, ARKit only.
for await frame in tracker.frames {
if let pose = frame.headPose {
print("Pitch: \(pose.pitch), Yaw: \(pose.yaw), Roll: \(pose.roll)")
}
}Disable with captureHeadPose: false in ARKitConfiguration. Not available for MediaPipe.
Record timestamped event markers to align face data with experimental stimuli (e.g., stimulus onset, trial boundaries). Events are written to events.json in the export directory.
try await tracker.start(participant: "P001")
tracker.addEvent("baseline_start")
// ... present stimulus ...
tracker.addEvent("stimulus_onset")
// ... wait ...
tracker.addEvent("stimulus_offset")
let result = try await tracker.stop()Events are no-ops when no session is active. The timestamps use the same epoch-seconds clock as Frame.timestamp.
blendshapes.csv — one row per frame with columns for each blend shape:
timestamp,frameIndex,browDownLeft,browDownRight,...,noseSneerRight
1718451234.5,0,0.012,0.009,...,0.003
1718451234.53,1,0.014,0.011,...,0.002
metadata.csv — includes head pose and action units alongside other metadata:
timestamp,frameIndex,isFaceTracked,...,headPose.pitch,headPose.yaw,headPose.roll,au1,au2,...,au26
{"timestamp":1718451234.5,"frameIndex":0,"browDownLeft":0.012,...}
{"timestamp":1718451234.53,"frameIndex":1,"browDownLeft":0.014,...}Export to a single HDF5 file for Python/scientific analysis workflows. Readable by h5py, MATLAB, and R. No external dependencies required.
let exportResult = try await tracker.export(
session: result.sessionID,
to: documentsURL,
options: .init(tabularFormat: .hdf5)
)
// exportResult.hdf5File → session.h5The HDF5 file contains these datasets:
| Dataset | Shape | Type | Description |
|---|---|---|---|
/blendshapes |
N × 52 | Float32 | Blend shape values per frame |
/metadata |
N × M | Float32 | Head pose, AUs, gaze, light, emotions |
/timestamps |
N | Float64 | Frame timestamps (epoch seconds) |
/frame_indices |
N | Int32 | Sequential frame numbers |
/events |
K | String | Event markers (if recorded) |
Each dataset has a columns attribute with comma-separated column names.
Reading in Python:
import h5py
with h5py.File("session.h5", "r") as f:
timestamps = f["timestamps"][:]
blendshapes = f["blendshapes"][:]
columns = f["blendshapes"].attrs["columns"].decode().split(",")When HDF5 is selected, CSV/JSONL files are not written.
| Feature | ARKit | MediaPipe |
|---|---|---|
| Device requirement | TrueDepth camera (Face ID) | Any front camera |
| Blend shapes | 52 (incl. tongueOut) |
51 |
| Action units | Yes (14 AUs, from blend shapes) | Yes (14 AUs, from blend shapes) |
| Head pose | Yes (pitch, yaw, roll) | No |
| Vertices | ~1220 (camera space, meters) | 478 (normalized 0-1) |
| Gaze tracking | Yes | No |
| Light estimation | Yes | No |
| Distance to screen | Yes | No |
| Depth maps | Yes | No |
| Event markers | Yes | Yes |
| Model download | None needed | Auto (~4 MB, cached) |
See LICENSE for details.