Skip to content

digital-medicine/FaceTrackingKit

Repository files navigation

FaceTrackingKit

A Swift framework for real-time face tracking on iOS, built for research. Supports both ARKit (TrueDepth camera devices) and MediaPipe (any device with a front camera).

Features

  • Real-time blend shapes, face mesh vertices, gaze tracking, and light estimation
  • FACS Action Units — 14 Action Unit intensities computed automatically from blend shapes
  • Head pose — pitch, yaw, roll in radians (ARKit only)
  • Event markers — timestamped labels for aligning face data with experimental stimuli
  • Async stream API for live frame access
  • Built-in session storage with CSV, JSON Lines, and HDF5 export
  • Optional image and depth map capture
  • MediaPipe model is downloaded and cached automatically

Requirements

  • iOS 17+
  • Swift 6.0+

Installation

Add FaceTrackingKit to your project via Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/digital-medicine/FaceTrackingKit", from: "0.1.0")
]

Quick Start

1. Choose a provider

ARKit — High accuracy, requires Face ID device (iPhone/iPad with TrueDepth camera). Provides 52 blend shapes, gaze tracking, light estimation, and depth maps.

MediaPipe — Works on any iOS device with a front camera. Provides 51 blend shapes and 478 face landmarks. The model (~4 MB) is downloaded automatically on first use.

2. Add camera permission

Add to your Info.plist:

NSCameraUsageDescription
This app uses the camera for face tracking.

3. Track faces

import FaceTrackingKit

// Create a tracker — pick one:
let tracker = FaceTracker(provider: .arKit())       // ARKit (Face ID devices)
let tracker = FaceTracker(provider: .mediaPipe())    // MediaPipe (any device)

// Start a session
try await tracker.start(participant: "P001")

// Mark experimental events
tracker.addEvent("stimulus_onset")

// Read frames in real time
for await frame in tracker.frames {
    if let aus = frame.actionUnits {
        print("AU12 (smile): \(aus[.au12] ?? 0)")
    }
    if let pose = frame.headPose {
        print("Head yaw: \(pose.yaw) rad")
    }
}

// Stop and get a summary
let result = try await tracker.stop()
print("Captured \(result.frameCount) frames over \(result.duration)s")

4. Export session data

let documentsURL = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first!

let exportResult = try await tracker.export(
    session: result.sessionID,
    to: documentsURL,
    options: .init(tabularFormat: .csv, includeImages: false)
)

print("Exported to: \(exportResult.directory.path)")
// Produces:
//   session_P001_2025-06-15/
//     session.json          — session metadata
//     blendshapes.csv       — one row per frame, one column per blend shape
//     metadata.csv          — timestamps, gaze, light, head pose, action units
//     events.json           — timestamped event markers (if any)

Configuration

ARKit

let tracker = FaceTracker(provider: .arKit(.init(
    captureBlendShapes: true,
    captureLookAtPoint: true,
    captureLightEstimation: true,
    captureDistanceToScreen: true,
    captureHeadPose: true,
    vertices: .all(precision: .float32),
    captureImages: .everyNthFrame(10),
    captureDepthMaps: .none,
    maxTrackedFaces: 1
)))

MediaPipe

let tracker = FaceTracker(provider: .mediaPipe(.init(
    captureBlendShapes: true,
    vertices: .all(precision: .float32),
    captureImages: .everyNthFrame(10),
    maxTrackedFaces: 1
)))

To use a bundled model file instead of the automatic download:

let tracker = FaceTracker(provider: .mediaPipe(.init(
    modelPath: Bundle.main.url(forResource: "face_landmarker", withExtension: "task")!
)))

Vertex capture modes

.none                                       // No vertices (default)
.all(precision: .float32)                   // All vertices, full precision
.all(precision: .float16)                   // All vertices, half precision (~50% smaller)
.subset(indices: [0, 10, 20], precision: .float32)  // Specific vertex indices
.stride(4, precision: .float32)             // Every 4th vertex

Image capture modes

.none               // No images (default)
.everyFrame         // Every frame (~100-150 MB/min)
.everyNthFrame(5)   // Every 5th frame

Session Management

// List all stored sessions
let sessions = try await tracker.listSessions()
for session in sessions {
    print("\(session.participant): \(session.frameCount) frames, \(session.storageSizeBytes) bytes")
}

// Delete a session
try await tracker.deleteSession(session.sessionID)

Pause and Resume

try await tracker.start(participant: "P001")

// Pause tracking (data is preserved)
try tracker.pause()

// Resume tracking (appends to the same session)
try await tracker.resume()

let result = try await tracker.stop()

Error Handling

do {
    try await tracker.start(participant: "P001")
} catch FaceTrackerError.providerUnavailable(let name) {
    print("Provider \(name) is not supported on this device")
} catch FaceTrackerError.modelDownloadFailed(let error) {
    print("Could not download MediaPipe model: \(error)")
} catch FaceTrackerError.permissionDenied {
    print("Camera access was denied")
}

Action Units

FaceTrackingKit automatically computes 14 FACS Action Unit intensities from blend shapes (based on EMFACS Table 1, Aldenhoven et al.). Action units are available on every frame when blend shapes are enabled — no additional configuration needed.

for await frame in tracker.frames {
    if let aus = frame.actionUnits {
        print("AU6 (cheek raise): \(aus[.au6] ?? 0)")
        print("AU12 (smile): \(aus[.au12] ?? 0)")
    }
}

You can also compute action units from arbitrary blend shape dictionaries:

let aus = FaceTracker.actionUnits(from: blendShapes)
AU Name Blend Shapes
AU1 Inner Brow Raise browInnerUp
AU2 Outer Brow Raise mean(browOuterUpLeft, browOuterUpRight)
AU4 Brow Lowerer mean(browDownLeft, browDownRight)
AU5 Upper Lid Raise mean(eyeWideLeft, eyeWideRight)
AU6 Cheek Raise mean(cheekSquintLeft, cheekSquintRight)
AU7 Lid Tightener mean(eyeSquintLeft, eyeSquintRight)
AU9 Nose Wrinkler mean(noseSneerLeft, noseSneerRight)
AU12 Lip Corner Puller mean(mouthSmileLeft, mouthSmileRight)
AU14 Dimpler mean(mouthDimpleLeft, mouthDimpleRight)
AU15 Lip Corner Depressor mean(mouthFrownLeft, mouthFrownRight)
AU16 Lower Lip Depressor mean(mouthLowerDownLeft, mouthLowerDownRight)
AU20 Lip Stretcher mean(mouthStretchLeft, mouthStretchRight)
AU23 Lip Tightener mouthPucker
AU26 Jaw Drop jawOpen

Head Pose

Head orientation (pitch, yaw, roll) is extracted from the ARKit face anchor transform. Enabled by default, ARKit only.

for await frame in tracker.frames {
    if let pose = frame.headPose {
        print("Pitch: \(pose.pitch), Yaw: \(pose.yaw), Roll: \(pose.roll)")
    }
}

Disable with captureHeadPose: false in ARKitConfiguration. Not available for MediaPipe.

Event Markers

Record timestamped event markers to align face data with experimental stimuli (e.g., stimulus onset, trial boundaries). Events are written to events.json in the export directory.

try await tracker.start(participant: "P001")

tracker.addEvent("baseline_start")
// ... present stimulus ...
tracker.addEvent("stimulus_onset")
// ... wait ...
tracker.addEvent("stimulus_offset")

let result = try await tracker.stop()

Events are no-ops when no session is active. The timestamps use the same epoch-seconds clock as Frame.timestamp.

Export Formats

CSV (default)

blendshapes.csv — one row per frame with columns for each blend shape:

timestamp,frameIndex,browDownLeft,browDownRight,...,noseSneerRight
1718451234.5,0,0.012,0.009,...,0.003
1718451234.53,1,0.014,0.011,...,0.002

metadata.csv — includes head pose and action units alongside other metadata:

timestamp,frameIndex,isFaceTracked,...,headPose.pitch,headPose.yaw,headPose.roll,au1,au2,...,au26

JSON Lines

{"timestamp":1718451234.5,"frameIndex":0,"browDownLeft":0.012,...}
{"timestamp":1718451234.53,"frameIndex":1,"browDownLeft":0.014,...}

HDF5

Export to a single HDF5 file for Python/scientific analysis workflows. Readable by h5py, MATLAB, and R. No external dependencies required.

let exportResult = try await tracker.export(
    session: result.sessionID,
    to: documentsURL,
    options: .init(tabularFormat: .hdf5)
)
// exportResult.hdf5File → session.h5

The HDF5 file contains these datasets:

Dataset Shape Type Description
/blendshapes N × 52 Float32 Blend shape values per frame
/metadata N × M Float32 Head pose, AUs, gaze, light, emotions
/timestamps N Float64 Frame timestamps (epoch seconds)
/frame_indices N Int32 Sequential frame numbers
/events K String Event markers (if recorded)

Each dataset has a columns attribute with comma-separated column names.

Reading in Python:

import h5py

with h5py.File("session.h5", "r") as f:
    timestamps = f["timestamps"][:]
    blendshapes = f["blendshapes"][:]
    columns = f["blendshapes"].attrs["columns"].decode().split(",")

When HDF5 is selected, CSV/JSONL files are not written.

Provider Comparison

Feature ARKit MediaPipe
Device requirement TrueDepth camera (Face ID) Any front camera
Blend shapes 52 (incl. tongueOut) 51
Action units Yes (14 AUs, from blend shapes) Yes (14 AUs, from blend shapes)
Head pose Yes (pitch, yaw, roll) No
Vertices ~1220 (camera space, meters) 478 (normalized 0-1)
Gaze tracking Yes No
Light estimation Yes No
Distance to screen Yes No
Depth maps Yes No
Event markers Yes Yes
Model download None needed Auto (~4 MB, cached)

License

See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors