Back to Software
Software Engineering

AMS — Audiobook Mastering Suite (ASR + Forced Alignment + DSP Pipeline)

A multi-stage audiobook pipeline that aligns manuscript text to recorded audio, producing word-level timing artifacts and a foundation for automation in mastering and QC.

audiodotnetzigasrforced-alignmentdsptoolingcli

Executive Summary

AMS is my “magnum opus” tooling project: a CLI-driven pipeline for audiobook production that turns (book text + chapter audio) into structured artifacts like ASR transcripts, anchor maps, hydrated transcripts, and forced-alignment timings—building a foundation for automation in mastering and quality control.

The problem space

Audiobook work is uniquely painful because you need both:

  • Audio correctness (levels, noise floor, spacing, consistency)
  • Text correctness (the spoken audio matches the manuscript, including pacing and boundaries)

Most workflows treat these as separate. AMS tries to unify them by making “text ↔ audio alignment” a first-class artifact.

Design goals

  • Repeatable pipeline: deterministic artifacts, stable outputs
  • Extensible architecture: CLI now, UI/daemon later
  • Language/tool flexibility: use the best tool for each job
    • ASR/alignment where ML helps
    • DSP where low-level performance matters
    • .NET orchestration for workflow, DI, composition, and host flexibility

Pipeline Overview

AMS runs in stages: ASR → anchor discovery → transcript indexing → hydration → forced alignment → timing merge, outputting JSON artifacts at each step so you can debug or swap stages.

Stages and artifacts

  1. ASR stage

    • Produces initial word/token timings from audio
    • Output example: chapter.asr.json
  2. Alignment stage

    • Identifies reliable anchor points between manuscript and ASR output
    • Builds transcript index and hydrated transcript mapping
    • Output examples:
      • chapter.align.anchors.json
      • chapter.align.tx.json
      • chapter.align.hydrate.json
  3. Forced alignment stage (MFA)

    • Refines timings with phoneme/word boundary precision
    • Produces TextGrid + analysis artifacts
  4. Timing merge

    • Merges MFA timing back into the hydrated transcript artifacts
    • Produces “final timing truth” suitable for QC automation

Why artifact-first matters

Artifact-first pipelines are debuggable:

  • If ASR is wrong, you can see it
  • If anchors drift, you can inspect the anchor file
  • If MFA fails due to OOV words, you can inspect the dictionary generation output

This is the difference between a “magic black box” and a professional pipeline.


Architecture

AMS is structured around use-case commands (single responsibility entry points) and hostable orchestration services, so the exact same core can power CLI, web UI, or a future daemon.

Solution Structure

The 9-project solution separates concerns across hosts and core:

Ams.sln
├── Ams.Core           # Domain library - all pipeline logic
├── Ams.Cli            # CLI host with Cocona commands
├── Ams.Web            # Blazor Server SSR host
├── Ams.Web.Client     # Blazor WebAssembly client
├── Ams.Web.Api        # Web API endpoints
├── Ams.UI.Avalonia    # Cross-platform desktop UI
└── Ams.Tests          # Test project

Runtime vs Services Architecture

The architecture separates Runtime (contained execution units) from Services (orchestration):

  • Runtime = Self-contained contexts that own state and artifacts
    • Workspace, BookContext, ChapterContext, AudioBufferContext
  • Services = Orchestration layer that coordinates runtime + external tools
    • AsrService, FFmpegService, AlignmentService

This separation enables different hosts (CLI, Web, Desktop) to share all business logic.

IWorkspace Interface

The key abstraction that enables multi-host architecture:

public interface IWorkspace
{
    /// <summary>
    /// Root directory for this workspace (typically the book folder).
    /// </summary>
    string RootPath { get; }

    /// <summary>
    /// The long-lived book context for this workspace.
    /// </summary>
    BookContext Book { get; }

    /// <summary>
    /// Convenience accessor for the chapter manager.
    /// </summary>
    ChapterManager Chapters => Book.Chapters;

    /// <summary>
    /// Opens (or creates) a chapter context with the supplied options.
    /// </summary>
    ChapterContextHandle OpenChapter(ChapterOpenOptions options);
}

Each host implements this interface: CliWorkspace, BlazorWorkspace, AvaloniaWorkspace. The same Ams.Core powers all of them.

Manager → Context → Documents Pattern

Runtime follows a consistent hierarchy:

  1. Manager - Owns multiple contexts with cursor-based navigation
  2. Context - Single execution context with lifecycle (book, chapter, audio buffer)
  3. Documents - Multiple document slots with lazy loading and dirty tracking
public sealed class ChapterDocuments
{
    private readonly DocumentSlot<TranscriptIndex> _transcript;
    private readonly DocumentSlot<HydratedTranscript> _hydratedTranscript;
    private readonly DocumentSlot<AnchorDocument> _anchors;
    private readonly DocumentSlot<AsrResponse> _asr;
    private readonly DocumentSlot<PauseAdjustmentsDocument> _pauseAdjustments;
    private readonly DocumentSlot<TextGridDocument> _textGrid;

    // Each slot provides lazy loading, dirty tracking, and persistence
    internal bool IsDirty =>
        _transcript.IsDirty ||
        _hydratedTranscript.IsDirty ||
        _anchors.IsDirty /* ... */;
}

This enables efficient memory management—ChapterManager uses LRU eviction to bound memory when processing books with many chapters.


Alignment: Anchors, Indexing, Hydration

The alignment strategy is built around finding “high-confidence” anchor points and then expanding into a full manuscript-to-audio map.

Anchor Discovery with LIS

Anchors are stable points where text and ASR strongly agree. The algorithm uses n-gram matching with Longest Increasing Subsequence (LIS) to ensure monotonicity:

public static IReadOnlyList<Anchor> SelectAnchors(
    IReadOnlyList<string> bookTokens,
    IReadOnlyList<int> bookSentenceIndex,
    IReadOnlyList<string> asrTokens,
    AnchorPolicy policy)
{
    // Find unique n-gram matches (book position, ASR position)
    var anchors = Collect(bookTokens, bookSentenceIndex, asrTokens, n,
        okBook: list => list.Count == 1,  // Unique in book
        okAsr: list => list.Count == 1,   // Unique in ASR
        policy);

    // Density control: relax uniqueness if too few anchors
    if (anchors.Count < desired && n > 2)
    {
        var subPolicy = policy with { NGram = n - 1 };
        anchors = SelectAnchors(bookTokens, bookSentenceIndex, asrTokens, subPolicy);
    }

    // Monotonicity: LIS ensures anchors don't "cross"
    anchors.Sort((x, y) => x.Bp.CompareTo(y.Bp));
    var lisPairs = LisByAp(anchors.Select(a => (a.Bp, a.Ap)).ToList());
    return lisPairs.Select(p => new Anchor(p.bp, p.ap)).ToList();
}

This prevents alignment drift across long chapters by establishing “checkpoint” positions that both streams agree on.

Windowed Dynamic Programming Aligner

Between anchors, a windowed DP aligner fills gaps with phoneme-aware substitution costs:

public static List<AlignResult> AlignWindows(
    IReadOnlyList<string> bookNorm,
    IReadOnlyList<string> asrNorm,
    IReadOnlyList<(int bLo, int bHi, int aLo, int aHi)> windows,
    IReadOnlyDictionary<string, string> equiv,
    ISet<string> fillers)
{
    foreach (var (bLo, bHi, aLo, aHi) in windows)
    {
        var dp = new double[n + 1, m + 1];

        // Fill DP with phoneme-aware costs
        for (int i = 1; i <= n; i++)
        for (int j = 1; j <= m; j++)
        {
            var sub = dp[i - 1, j - 1] + SubCost(bookNorm[bLo + i - 1], asrNorm[aLo + j - 1], equiv);
            var del = dp[i - 1, j] + DelCost(bookNorm[bLo + i - 1]);
            var ins = dp[i, j - 1] + InsCost(asrNorm[aLo + j - 1], fillers);
            // Select minimum cost path...
        }
    }
}

The equiv dictionary handles phonetic equivalences (e.g., “gonna” ↔ “going to”), while fillers gives lower insertion cost to speech disfluencies.

Hydration

Hydration takes the index and produces a rich timed transcript suitable for:

  • segmenting chapters by paragraph/scene
  • identifying pacing irregularities
  • building UI overlays (future)

MFA Integration

For production-grade timing, the pipeline integrates Montreal Forced Aligner:

  1. Dictionary generation - Auto-generate pronunciations for OOV words
  2. Acoustic model - Pre-trained English model handles narrator variation
  3. TextGrid output - Phoneme/word boundaries at sample-level precision
  4. Timing merge - MFA timings merged back into hydrated transcript artifacts

DSP Strategy (Zig + Plugin Chain Concepts)

AMS treats audio processing as its own domain: orchestration in .NET, DSP in a low-level performant layer, with a long-term plan for chain-based processing and plugin introspection.

Zig DSP Module Pattern

DSP is performance-sensitive and benefits from predictable low-level control. Zig modules export through C ABI for .NET interop:

// Real-time safe DSP with parameter smoothing
pub const Processor = struct {
    sample_rate: f32,
    target_gain: f32,
    current_gain: f32,
    smooth_coeff: f32,

    pub fn init(sample_rate: f32) Processor {
        return .{
            .sample_rate = sample_rate,
            .target_gain = 1.0,
            .current_gain = 1.0,
            .smooth_coeff = 1.0 - @exp(-1.0 / (sample_rate * 0.01)),
        };
    }

    pub fn process(self: *Processor, buffer: []f32) void {
        for (buffer) |*sample| {
            // Real-time safe: no allocations, no locks
            self.current_gain += (self.target_gain - self.current_gain) * self.smooth_coeff;
            sample.* *= self.current_gain;
        }
    }
};

// C ABI export for PInvoke
export fn dsp_process(ctx: *Processor, buf: [*]f32, len: usize) void {
    ctx.process(buf[0..len]);
}

This pattern enables .NET orchestration with Zig performance for audio processing.

FFmpeg Fluent Filter API

For complex audio chains, a fluent C# API wraps libavfilter:

var asrReady = FfFilterGraph
    .FromBuffer(rawAudio)
    .HighPass(70)                    // Remove rumble
    .FftDenoise(12)                  // Reduce noise floor
    .LoudNorm(-18, 7, -2)            // EBU R128 normalization
    .Custom("pan=mono|c0=0.5*c0+0.5*c1")  // Downmix
    .ToBuffer();

Filters compose naturally and execute as a single FFmpeg process—no intermediate files.

Prosody-Aware Pause Dynamics

The pipeline classifies pauses by structural context (sentence, paragraph, section):

public PauseAnalysisReport AnalyzeChapter(
    TranscriptIndex transcript,
    BookIndex bookIndex,
    PausePolicy policy)
{
    var sentenceToParagraph = BuildSentenceParagraphMap(bookIndex);
    var headingParagraphIds = BuildHeadingParagraphSet(bookIndex);

    var spans = BuildInterSentenceSpans(transcript, sentenceToParagraph, headingParagraphIds);

    // Group by pause class and compute statistics
    var classStats = new Dictionary<PauseClass, PauseClassSummary>();
    foreach (var group in spans.GroupBy(span => span.Class))
    {
        classStats[group.Key] = PauseClassSummary.FromDurations(
            group.Select(x => x.DurationSec));
    }
}

This enables intelligent pause normalization—paragraph breaks should breathe more than sentence breaks.

Plugin chain / node graph direction

The goal is a future node-based UI where:

  • each stage is a node (ASR, alignment, noise profile, limiter, etc.)
  • edges represent data (audio stream, transcript artifact, timing map)
  • every node output is an artifact you can diff and reason about

AMS is structured now so this UI can exist later without rewriting the core.


What Makes This a Portfolio-Grade Project

AMS demonstrates “systems thinking”: multi-stage pipelines, strong separation of concerns, artifact-first debugging, and a realistic plan to evolve from CLI to UI without rewriting the engine.

Engineering themes I’m deliberately showcasing

  • Hostable application core (CLI today, other hosts later)
  • Explicit, inspectable artifacts at each pipeline stage
  • Separation of concerns (alignment vs orchestration vs DSP)
  • Tooling pragmatism (use the right tool for the job)
  • Scalability path (concurrency, batching, deterministic artifacts)

What I would add next

  • Golden test corpus with known-good chapters for regression testing
  • An “operator UI” that visualizes:
    • anchor confidence
    • drift zones
    • timing discrepancies
  • Automated QC rules:
    • long silences
    • inconsistent room tone segments
    • mouth noise classification hooks

Technologies Used

  • .NET (net9 target in this branch)
  • CLI orchestration + DI
  • ASR integration (pipeline stage)
  • Montreal Forced Aligner (forced alignment stage)
  • Zig for DSP building blocks and future chain execution
  • JSON artifact pipeline design