Software Engineering September 3, 2025

AMS — Audiobook Mastering Suite (ASR + Forced Alignment + DSP Pipeline)

A multi-stage audiobook pipeline that aligns manuscript text to recorded audio, producing word-level timing artifacts and a foundation for automation in mastering and QC.

audiodotnetzigasrforced-alignmentdsptoolingcli

Executive Summary

AMS is my “magnum opus” tooling project: a CLI-driven pipeline for audiobook production that turns (book text + chapter audio) into structured artifacts like ASR transcripts, anchor maps, hydrated transcripts, and forced-alignment timings—building a foundation for automation in mastering and quality control.

The problem space

Audiobook work is uniquely painful because you need both:

Audio correctness (levels, noise floor, spacing, consistency)
Text correctness (the spoken audio matches the manuscript, including pacing and boundaries)

Most workflows treat these as separate. AMS tries to unify them by making “text ↔ audio alignment” a first-class artifact.

Design goals

Repeatable pipeline: deterministic artifacts, stable outputs
Extensible architecture: CLI now, UI/daemon later
Language/tool flexibility: use the best tool for each job
- ASR/alignment where ML helps
- DSP where low-level performance matters
- .NET orchestration for workflow, DI, composition, and host flexibility

Pipeline Overview

AMS runs in stages: ASR → anchor discovery → transcript indexing → hydration → forced alignment → timing merge, outputting JSON artifacts at each step so you can debug or swap stages.

Stages and artifacts

ASR stage
- Produces initial word/token timings from audio
- Output example: chapter.asr.json
Alignment stage
- Identifies reliable anchor points between manuscript and ASR output
- Builds transcript index and hydrated transcript mapping
- Output examples:
  - chapter.align.anchors.json
  - chapter.align.tx.json
  - chapter.align.hydrate.json
Forced alignment stage (MFA)
- Refines timings with phoneme/word boundary precision
- Produces TextGrid + analysis artifacts
Timing merge
- Merges MFA timing back into the hydrated transcript artifacts
- Produces “final timing truth” suitable for QC automation

Why artifact-first matters

Artifact-first pipelines are debuggable:

If ASR is wrong, you can see it
If anchors drift, you can inspect the anchor file
If MFA fails due to OOV words, you can inspect the dictionary generation output

This is the difference between a “magic black box” and a professional pipeline.

Architecture

AMS is structured around use-case commands (single responsibility entry points) and hostable orchestration services, so the exact same core can power CLI, web UI, or a future daemon.

Solution Structure

The 9-project solution separates concerns across hosts and core:

Ams.sln
├── Ams.Core           # Domain library - all pipeline logic
├── Ams.Cli            # CLI host with Cocona commands
├── Ams.Web            # Blazor Server SSR host
├── Ams.Web.Client     # Blazor WebAssembly client
├── Ams.Web.Api        # Web API endpoints
├── Ams.UI.Avalonia    # Cross-platform desktop UI
└── Ams.Tests          # Test project

Runtime vs Services Architecture

The architecture separates Runtime (contained execution units) from Services (orchestration):

Runtime = Self-contained contexts that own state and artifacts
- Workspace, BookContext, ChapterContext, AudioBufferContext
Services = Orchestration layer that coordinates runtime + external tools
- AsrService, FFmpegService, AlignmentService

This separation enables different hosts (CLI, Web, Desktop) to share all business logic.

IWorkspace Interface

The key abstraction that enables multi-host architecture:

public interface IWorkspace
{
    /// <summary>
    /// Root directory for this workspace (typically the book folder).
    /// </summary>
    string RootPath { get; }

    /// <summary>
    /// The long-lived book context for this workspace.
    /// </summary>
    BookContext Book { get; }

    /// <summary>
    /// Convenience accessor for the chapter manager.
    /// </summary>
    ChapterManager Chapters => Book.Chapters;

    /// <summary>
    /// Opens (or creates) a chapter context with the supplied options.
    /// </summary>
    ChapterContextHandle OpenChapter(ChapterOpenOptions options);
}

Each host implements this interface: CliWorkspace, BlazorWorkspace, AvaloniaWorkspace. The same Ams.Core powers all of them.

Manager → Context → Documents Pattern

Runtime follows a consistent hierarchy:

Manager - Owns multiple contexts with cursor-based navigation
Context - Single execution context with lifecycle (book, chapter, audio buffer)
Documents - Multiple document slots with lazy loading and dirty tracking

public sealed class ChapterDocuments
{
    private readonly DocumentSlot<TranscriptIndex> _transcript;
    private readonly DocumentSlot<HydratedTranscript> _hydratedTranscript;
    private readonly DocumentSlot<AnchorDocument> _anchors;
    private readonly DocumentSlot<AsrResponse> _asr;
    private readonly DocumentSlot<PauseAdjustmentsDocument> _pauseAdjustments;
    private readonly DocumentSlot<TextGridDocument> _textGrid;

    // Each slot provides lazy loading, dirty tracking, and persistence
    internal bool IsDirty =>
        _transcript.IsDirty ||
        _hydratedTranscript.IsDirty ||
        _anchors.IsDirty /* ... */;
}

This enables efficient memory management—ChapterManager uses LRU eviction to bound memory when processing books with many chapters.

Alignment: Anchors, Indexing, Hydration

The alignment strategy is built around finding “high-confidence” anchor points and then expanding into a full manuscript-to-audio map.

Anchor Discovery with LIS

Anchors are stable points where text and ASR strongly agree. The algorithm uses n-gram matching with Longest Increasing Subsequence (LIS) to ensure monotonicity:

public static IReadOnlyList<Anchor> SelectAnchors(
    IReadOnlyList<string> bookTokens,
    IReadOnlyList<int> bookSentenceIndex,
    IReadOnlyList<string> asrTokens,
    AnchorPolicy policy)
{
    // Find unique n-gram matches (book position, ASR position)
    var anchors = Collect(bookTokens, bookSentenceIndex, asrTokens, n,
        okBook: list => list.Count == 1,  // Unique in book
        okAsr: list => list.Count == 1,   // Unique in ASR
        policy);

    // Density control: relax uniqueness if too few anchors
    if (anchors.Count < desired && n > 2)
    {
        var subPolicy = policy with { NGram = n - 1 };
        anchors = SelectAnchors(bookTokens, bookSentenceIndex, asrTokens, subPolicy);
    }

    // Monotonicity: LIS ensures anchors don't "cross"
    anchors.Sort((x, y) => x.Bp.CompareTo(y.Bp));
    var lisPairs = LisByAp(anchors.Select(a => (a.Bp, a.Ap)).ToList());
    return lisPairs.Select(p => new Anchor(p.bp, p.ap)).ToList();
}

This prevents alignment drift across long chapters by establishing “checkpoint” positions that both streams agree on.

Windowed Dynamic Programming Aligner

Between anchors, a windowed DP aligner fills gaps with phoneme-aware substitution costs:

public static List<AlignResult> AlignWindows(
    IReadOnlyList<string> bookNorm,
    IReadOnlyList<string> asrNorm,
    IReadOnlyList<(int bLo, int bHi, int aLo, int aHi)> windows,
    IReadOnlyDictionary<string, string> equiv,
    ISet<string> fillers)
{
    foreach (var (bLo, bHi, aLo, aHi) in windows)
    {
        var dp = new double[n + 1, m + 1];

        // Fill DP with phoneme-aware costs
        for (int i = 1; i <= n; i++)
        for (int j = 1; j <= m; j++)
        {
            var sub = dp[i - 1, j - 1] + SubCost(bookNorm[bLo + i - 1], asrNorm[aLo + j - 1], equiv);
            var del = dp[i - 1, j] + DelCost(bookNorm[bLo + i - 1]);
            var ins = dp[i, j - 1] + InsCost(asrNorm[aLo + j - 1], fillers);
            // Select minimum cost path...
        }
    }
}

The equiv dictionary handles phonetic equivalences (e.g., “gonna” ↔ “going to”), while fillers gives lower insertion cost to speech disfluencies.

Hydration

Hydration takes the index and produces a rich timed transcript suitable for:

segmenting chapters by paragraph/scene
identifying pacing irregularities
building UI overlays (future)

MFA Integration

For production-grade timing, the pipeline integrates Montreal Forced Aligner:

Dictionary generation - Auto-generate pronunciations for OOV words
Acoustic model - Pre-trained English model handles narrator variation
TextGrid output - Phoneme/word boundaries at sample-level precision
Timing merge - MFA timings merged back into hydrated transcript artifacts

DSP Strategy (Zig + Plugin Chain Concepts)

AMS treats audio processing as its own domain: orchestration in .NET, DSP in a low-level performant layer, with a long-term plan for chain-based processing and plugin introspection.

Zig DSP Module Pattern

DSP is performance-sensitive and benefits from predictable low-level control. Zig modules export through C ABI for .NET interop:

// Real-time safe DSP with parameter smoothing
pub const Processor = struct {
    sample_rate: f32,
    target_gain: f32,
    current_gain: f32,
    smooth_coeff: f32,

    pub fn init(sample_rate: f32) Processor {
        return .{
            .sample_rate = sample_rate,
            .target_gain = 1.0,
            .current_gain = 1.0,
            .smooth_coeff = 1.0 - @exp(-1.0 / (sample_rate * 0.01)),
        };
    }

    pub fn process(self: *Processor, buffer: []f32) void {
        for (buffer) |*sample| {
            // Real-time safe: no allocations, no locks
            self.current_gain += (self.target_gain - self.current_gain) * self.smooth_coeff;
            sample.* *= self.current_gain;
        }
    }
};

// C ABI export for PInvoke
export fn dsp_process(ctx: *Processor, buf: [*]f32, len: usize) void {
    ctx.process(buf[0..len]);
}

This pattern enables .NET orchestration with Zig performance for audio processing.

FFmpeg Fluent Filter API

For complex audio chains, a fluent C# API wraps libavfilter:

var asrReady = FfFilterGraph
    .FromBuffer(rawAudio)
    .HighPass(70)                    // Remove rumble
    .FftDenoise(12)                  // Reduce noise floor
    .LoudNorm(-18, 7, -2)            // EBU R128 normalization
    .Custom("pan=mono|c0=0.5*c0+0.5*c1")  // Downmix
    .ToBuffer();

Filters compose naturally and execute as a single FFmpeg process—no intermediate files.

Prosody-Aware Pause Dynamics

The pipeline classifies pauses by structural context (sentence, paragraph, section):

public PauseAnalysisReport AnalyzeChapter(
    TranscriptIndex transcript,
    BookIndex bookIndex,
    PausePolicy policy)
{
    var sentenceToParagraph = BuildSentenceParagraphMap(bookIndex);
    var headingParagraphIds = BuildHeadingParagraphSet(bookIndex);

    var spans = BuildInterSentenceSpans(transcript, sentenceToParagraph, headingParagraphIds);

    // Group by pause class and compute statistics
    var classStats = new Dictionary<PauseClass, PauseClassSummary>();
    foreach (var group in spans.GroupBy(span => span.Class))
    {
        classStats[group.Key] = PauseClassSummary.FromDurations(
            group.Select(x => x.DurationSec));
    }
}

This enables intelligent pause normalization—paragraph breaks should breathe more than sentence breaks.

Plugin chain / node graph direction

The goal is a future node-based UI where:

each stage is a node (ASR, alignment, noise profile, limiter, etc.)
edges represent data (audio stream, transcript artifact, timing map)
every node output is an artifact you can diff and reason about

AMS is structured now so this UI can exist later without rewriting the core.

What Makes This a Portfolio-Grade Project

AMS demonstrates “systems thinking”: multi-stage pipelines, strong separation of concerns, artifact-first debugging, and a realistic plan to evolve from CLI to UI without rewriting the engine.

Engineering themes I’m deliberately showcasing

Hostable application core (CLI today, other hosts later)
Explicit, inspectable artifacts at each pipeline stage
Separation of concerns (alignment vs orchestration vs DSP)
Tooling pragmatism (use the right tool for the job)
Scalability path (concurrency, batching, deterministic artifacts)

What I would add next

Golden test corpus with known-good chapters for regression testing
An “operator UI” that visualizes:
- anchor confidence
- drift zones
- timing discrepancies
Automated QC rules:
- long silences
- inconsistent room tone segments
- mouth noise classification hooks

Technologies Used

.NET (net9 target in this branch)
CLI orchestration + DI
ASR integration (pipeline stage)
Montreal Forced Aligner (forced alignment stage)
Zig for DSP building blocks and future chain execution
JSON artifact pipeline design