AMS — Audiobook Mastering Suite (ASR + Forced Alignment + DSP Pipeline)
A multi-stage audiobook pipeline that aligns manuscript text to recorded audio, producing word-level timing artifacts and a foundation for automation in mastering and QC.
Executive Summary
AMS is my “magnum opus” tooling project: a CLI-driven pipeline for audiobook production that turns (book text + chapter audio) into structured artifacts like ASR transcripts, anchor maps, hydrated transcripts, and forced-alignment timings—building a foundation for automation in mastering and quality control.
The problem space
Audiobook work is uniquely painful because you need both:
- Audio correctness (levels, noise floor, spacing, consistency)
- Text correctness (the spoken audio matches the manuscript, including pacing and boundaries)
Most workflows treat these as separate. AMS tries to unify them by making “text ↔ audio alignment” a first-class artifact.
Design goals
- Repeatable pipeline: deterministic artifacts, stable outputs
- Extensible architecture: CLI now, UI/daemon later
- Language/tool flexibility: use the best tool for each job
- ASR/alignment where ML helps
- DSP where low-level performance matters
- .NET orchestration for workflow, DI, composition, and host flexibility
Pipeline Overview
AMS runs in stages: ASR → anchor discovery → transcript indexing → hydration → forced alignment → timing merge, outputting JSON artifacts at each step so you can debug or swap stages.
Stages and artifacts
-
ASR stage
- Produces initial word/token timings from audio
- Output example:
chapter.asr.json
-
Alignment stage
- Identifies reliable anchor points between manuscript and ASR output
- Builds transcript index and hydrated transcript mapping
- Output examples:
chapter.align.anchors.jsonchapter.align.tx.jsonchapter.align.hydrate.json
-
Forced alignment stage (MFA)
- Refines timings with phoneme/word boundary precision
- Produces
TextGrid+ analysis artifacts
-
Timing merge
- Merges MFA timing back into the hydrated transcript artifacts
- Produces “final timing truth” suitable for QC automation
Why artifact-first matters
Artifact-first pipelines are debuggable:
- If ASR is wrong, you can see it
- If anchors drift, you can inspect the anchor file
- If MFA fails due to OOV words, you can inspect the dictionary generation output
This is the difference between a “magic black box” and a professional pipeline.
Architecture
AMS is structured around use-case commands (single responsibility entry points) and hostable orchestration services, so the exact same core can power CLI, web UI, or a future daemon.
Solution Structure
The 9-project solution separates concerns across hosts and core:
Ams.sln
├── Ams.Core # Domain library - all pipeline logic
├── Ams.Cli # CLI host with Cocona commands
├── Ams.Web # Blazor Server SSR host
├── Ams.Web.Client # Blazor WebAssembly client
├── Ams.Web.Api # Web API endpoints
├── Ams.UI.Avalonia # Cross-platform desktop UI
└── Ams.Tests # Test projectRuntime vs Services Architecture
The architecture separates Runtime (contained execution units) from Services (orchestration):
- Runtime = Self-contained contexts that own state and artifacts
Workspace,BookContext,ChapterContext,AudioBufferContext
- Services = Orchestration layer that coordinates runtime + external tools
AsrService,FFmpegService,AlignmentService
This separation enables different hosts (CLI, Web, Desktop) to share all business logic.
IWorkspace Interface
The key abstraction that enables multi-host architecture:
public interface IWorkspace
{
/// <summary>
/// Root directory for this workspace (typically the book folder).
/// </summary>
string RootPath { get; }
/// <summary>
/// The long-lived book context for this workspace.
/// </summary>
BookContext Book { get; }
/// <summary>
/// Convenience accessor for the chapter manager.
/// </summary>
ChapterManager Chapters => Book.Chapters;
/// <summary>
/// Opens (or creates) a chapter context with the supplied options.
/// </summary>
ChapterContextHandle OpenChapter(ChapterOpenOptions options);
}Each host implements this interface: CliWorkspace, BlazorWorkspace, AvaloniaWorkspace. The same Ams.Core powers all of them.
Manager → Context → Documents Pattern
Runtime follows a consistent hierarchy:
- Manager - Owns multiple contexts with cursor-based navigation
- Context - Single execution context with lifecycle (book, chapter, audio buffer)
- Documents - Multiple document slots with lazy loading and dirty tracking
public sealed class ChapterDocuments
{
private readonly DocumentSlot<TranscriptIndex> _transcript;
private readonly DocumentSlot<HydratedTranscript> _hydratedTranscript;
private readonly DocumentSlot<AnchorDocument> _anchors;
private readonly DocumentSlot<AsrResponse> _asr;
private readonly DocumentSlot<PauseAdjustmentsDocument> _pauseAdjustments;
private readonly DocumentSlot<TextGridDocument> _textGrid;
// Each slot provides lazy loading, dirty tracking, and persistence
internal bool IsDirty =>
_transcript.IsDirty ||
_hydratedTranscript.IsDirty ||
_anchors.IsDirty /* ... */;
}This enables efficient memory management—ChapterManager uses LRU eviction to bound memory when processing books with many chapters.
Alignment: Anchors, Indexing, Hydration
The alignment strategy is built around finding “high-confidence” anchor points and then expanding into a full manuscript-to-audio map.
Anchor Discovery with LIS
Anchors are stable points where text and ASR strongly agree. The algorithm uses n-gram matching with Longest Increasing Subsequence (LIS) to ensure monotonicity:
public static IReadOnlyList<Anchor> SelectAnchors(
IReadOnlyList<string> bookTokens,
IReadOnlyList<int> bookSentenceIndex,
IReadOnlyList<string> asrTokens,
AnchorPolicy policy)
{
// Find unique n-gram matches (book position, ASR position)
var anchors = Collect(bookTokens, bookSentenceIndex, asrTokens, n,
okBook: list => list.Count == 1, // Unique in book
okAsr: list => list.Count == 1, // Unique in ASR
policy);
// Density control: relax uniqueness if too few anchors
if (anchors.Count < desired && n > 2)
{
var subPolicy = policy with { NGram = n - 1 };
anchors = SelectAnchors(bookTokens, bookSentenceIndex, asrTokens, subPolicy);
}
// Monotonicity: LIS ensures anchors don't "cross"
anchors.Sort((x, y) => x.Bp.CompareTo(y.Bp));
var lisPairs = LisByAp(anchors.Select(a => (a.Bp, a.Ap)).ToList());
return lisPairs.Select(p => new Anchor(p.bp, p.ap)).ToList();
}This prevents alignment drift across long chapters by establishing “checkpoint” positions that both streams agree on.
Windowed Dynamic Programming Aligner
Between anchors, a windowed DP aligner fills gaps with phoneme-aware substitution costs:
public static List<AlignResult> AlignWindows(
IReadOnlyList<string> bookNorm,
IReadOnlyList<string> asrNorm,
IReadOnlyList<(int bLo, int bHi, int aLo, int aHi)> windows,
IReadOnlyDictionary<string, string> equiv,
ISet<string> fillers)
{
foreach (var (bLo, bHi, aLo, aHi) in windows)
{
var dp = new double[n + 1, m + 1];
// Fill DP with phoneme-aware costs
for (int i = 1; i <= n; i++)
for (int j = 1; j <= m; j++)
{
var sub = dp[i - 1, j - 1] + SubCost(bookNorm[bLo + i - 1], asrNorm[aLo + j - 1], equiv);
var del = dp[i - 1, j] + DelCost(bookNorm[bLo + i - 1]);
var ins = dp[i, j - 1] + InsCost(asrNorm[aLo + j - 1], fillers);
// Select minimum cost path...
}
}
}The equiv dictionary handles phonetic equivalences (e.g., “gonna” ↔ “going to”), while fillers gives lower insertion cost to speech disfluencies.
Hydration
Hydration takes the index and produces a rich timed transcript suitable for:
- segmenting chapters by paragraph/scene
- identifying pacing irregularities
- building UI overlays (future)
MFA Integration
For production-grade timing, the pipeline integrates Montreal Forced Aligner:
- Dictionary generation - Auto-generate pronunciations for OOV words
- Acoustic model - Pre-trained English model handles narrator variation
- TextGrid output - Phoneme/word boundaries at sample-level precision
- Timing merge - MFA timings merged back into hydrated transcript artifacts
DSP Strategy (Zig + Plugin Chain Concepts)
AMS treats audio processing as its own domain: orchestration in .NET, DSP in a low-level performant layer, with a long-term plan for chain-based processing and plugin introspection.
Zig DSP Module Pattern
DSP is performance-sensitive and benefits from predictable low-level control. Zig modules export through C ABI for .NET interop:
// Real-time safe DSP with parameter smoothing
pub const Processor = struct {
sample_rate: f32,
target_gain: f32,
current_gain: f32,
smooth_coeff: f32,
pub fn init(sample_rate: f32) Processor {
return .{
.sample_rate = sample_rate,
.target_gain = 1.0,
.current_gain = 1.0,
.smooth_coeff = 1.0 - @exp(-1.0 / (sample_rate * 0.01)),
};
}
pub fn process(self: *Processor, buffer: []f32) void {
for (buffer) |*sample| {
// Real-time safe: no allocations, no locks
self.current_gain += (self.target_gain - self.current_gain) * self.smooth_coeff;
sample.* *= self.current_gain;
}
}
};
// C ABI export for PInvoke
export fn dsp_process(ctx: *Processor, buf: [*]f32, len: usize) void {
ctx.process(buf[0..len]);
}This pattern enables .NET orchestration with Zig performance for audio processing.
FFmpeg Fluent Filter API
For complex audio chains, a fluent C# API wraps libavfilter:
var asrReady = FfFilterGraph
.FromBuffer(rawAudio)
.HighPass(70) // Remove rumble
.FftDenoise(12) // Reduce noise floor
.LoudNorm(-18, 7, -2) // EBU R128 normalization
.Custom("pan=mono|c0=0.5*c0+0.5*c1") // Downmix
.ToBuffer();Filters compose naturally and execute as a single FFmpeg process—no intermediate files.
Prosody-Aware Pause Dynamics
The pipeline classifies pauses by structural context (sentence, paragraph, section):
public PauseAnalysisReport AnalyzeChapter(
TranscriptIndex transcript,
BookIndex bookIndex,
PausePolicy policy)
{
var sentenceToParagraph = BuildSentenceParagraphMap(bookIndex);
var headingParagraphIds = BuildHeadingParagraphSet(bookIndex);
var spans = BuildInterSentenceSpans(transcript, sentenceToParagraph, headingParagraphIds);
// Group by pause class and compute statistics
var classStats = new Dictionary<PauseClass, PauseClassSummary>();
foreach (var group in spans.GroupBy(span => span.Class))
{
classStats[group.Key] = PauseClassSummary.FromDurations(
group.Select(x => x.DurationSec));
}
}This enables intelligent pause normalization—paragraph breaks should breathe more than sentence breaks.
Plugin chain / node graph direction
The goal is a future node-based UI where:
- each stage is a node (ASR, alignment, noise profile, limiter, etc.)
- edges represent data (audio stream, transcript artifact, timing map)
- every node output is an artifact you can diff and reason about
AMS is structured now so this UI can exist later without rewriting the core.
What Makes This a Portfolio-Grade Project
AMS demonstrates “systems thinking”: multi-stage pipelines, strong separation of concerns, artifact-first debugging, and a realistic plan to evolve from CLI to UI without rewriting the engine.
Engineering themes I’m deliberately showcasing
- Hostable application core (CLI today, other hosts later)
- Explicit, inspectable artifacts at each pipeline stage
- Separation of concerns (alignment vs orchestration vs DSP)
- Tooling pragmatism (use the right tool for the job)
- Scalability path (concurrency, batching, deterministic artifacts)
What I would add next
- Golden test corpus with known-good chapters for regression testing
- An “operator UI” that visualizes:
- anchor confidence
- drift zones
- timing discrepancies
- Automated QC rules:
- long silences
- inconsistent room tone segments
- mouth noise classification hooks
Technologies Used
- .NET (net9 target in this branch)
- CLI orchestration + DI
- ASR integration (pipeline stage)
- Montreal Forced Aligner (forced alignment stage)
- Zig for DSP building blocks and future chain execution
- JSON artifact pipeline design