Authentic Sound: Why Audio Provenance Needs Hardware-Level Trust

December 2025

Background - Domain-Specific Tycho Clients

The disturbance we're experiencing as AI tools infiltrate every dimension of work is so deep and so general that it's become hard, logically and psychologically, to pause long enough within the borders any particular domain and consider its specific requirements. Every concrete example immediately suggests a hundred analogs, and it seems almost foolish to pay close attention to, say, the complaints of teachers about AI assisted cheating, the worries of visual artists about robot competition, the contamination of research data by simulated observational records, and the deceptive hyperrealism of confabulated news reports. Aren't these are all versions of the same problem?

The answer is no. Although we have been developing a low-level, domain-agnostic approach to securing digital data provenance, its real-world usefulness depends entirely on the technical stack; and the specificity and heterogeneity of these stacks across various domains is almost impossible to overstate.

So, although our research on the Tycho protocol focuses on securing trust in data generally we've had to come to terms with fact that Tycho clients will have endless variety. We've committed ourselves to creating some reference implementations in key domains, but these implementations are not products. Products require deeper commitments to domain-specific requirements. Instead, we think of our applications as documents, showing what Tycho can do and teaching developers (human and robot) how to use the network.

This first post on domain-specific clients will focus on audio engineering, asking the question: How can we render audio files more trustworthy?

Audio Forensics

Voice cloning has gone from a laboratory novelty to a cheap commodity within the last three years. This sudden increase in capacity for simulation has some predictable effects. Not only is it easier to present fake stuff as real; but it's also easier to dismiss real stuff as fake.

A few notorious cases provide common reference points: Michal Simecka's opposition campaign and Slovakia's pro-Russian government was undermined by fake audio of his voice supposedly discussing election fraud.¹ In 2024, thousands of New Hampshire voters received an AI-cloned call from "President Biden" telling them to skip the primary.² There are also regular reports of scam calls with impersonated identities: the jailed daughter, the stranded friend, etc.

The lesson of these first experiences with high quality AI-assisted fraudulent sound is that a wised-up listener will wrap themselves in an impenetrable blanket of skepticism. But skepticism is a costly defense, as lack of trust in evidence leaves us to be governed by prior belief.³

Of course, real tactical advantages don't need academic endorsements to become common, and the claim that "an AI faked my voice" has already been made numerous times by men in trouble.

There are also cultural costs associated with attribution failures. There is no current way to sort out attribution and reward for creative works used to train AI's, but whatever system evolves, disputes over provenance are inevitable. As long as there is no confidence in what was used for training, how consent was obtained, what portion of the work was used, and where the derivative products ended up, courts and artists will be helpless. Impoverishment of musicians and authors can hardly be seen as a barrier to progress. (It is more like a time-honored tradition.) But there are commercial interests that can't be so easily abused, and they are going to demand traceability. The Midjourney and MiniMax lawsuits premiered the legal template for Hollywood's defense of visual content; an audio sequel is expected in 2026.

Why Watermarking Falls Short

Simple watermarking cannot solve this problem. C2PA credentials can attach provenance metadata to audio files, but these credentials represent trust in the signer, not trust in the audio itself. Deceptive recordings, robot-generated audio, and abusive ripoffs can carry credentials too.⁴ ISRC codes identify recordings, and fingerprinting services like Shazam, YouTube Content ID, and Audible Magic can detect unauthorized copies—but only of works already in a database. A new recording has no reference point.

This is the fundamental limit of file-level provenance: it can record who signed a file and what edits were declared, but by the time a file exists, the audio could have come from anywhere.

The Physical-Digital Border

To understand how Tycho could provide a different level of guarantees, consider the layers through which audio passes on a typical Linux system:

┌─────────────────────────────────────────────────────────────┐ │ Applications (Audacity, DAWs, Recording Software) │ ├─────────────────────────────────────────────────────────────┤ │ Sound Servers (PipeWire, PulseAudio, JACK) │ ├─────────────────────────────────────────────────────────────┤ │ ALSA User-Space Library (libasound/alsa-lib) │ ├─────────────────────────────────────────────────────────────┤ │ ALSA Kernel API (PCM, Control, MIDI interfaces) │ ├─────────────────────────────────────────────────────────────┤ │ ALSA Drivers (sound/pci/, sound/usb/, sound/soc/) │ ├─────────────────────────────────────────────────────────────┤ │ Hardware (ADC/DAC chips, audio interfaces) │ └─────────────────────────────────────────────────────────────┘ ↑ │ DIGITAL-PHYSICAL BORDER ↓ ┌─────────────────────────────────────────────────────────────┐ │ Analog Signal (microphones, instruments) │ └─────────────────────────────────────────────────────────────┘

Each layer above the hardware represents an opportunity for manipulation. An application-level signature leaves every layer below it unattested. But the closer we move toward the digital-physical border, the smaller the attack surface. At the driver level, we can hash and sign audio buffers the moment they arrive from the ADC, before any user-space process can touch them. This is what makes hardware-level attestation fundamentally different from file-level signing: the guarantees propagate downstream to every application that consumes the audio, rather than being applied after the fact to data of unknown origin.

The Tycho protocol provides cryptographic links between phenomena and their measurements, enabling us to answer questions like:

Temporal Anchoring: When exactly was this captured, anchored to consensus time?
Instrument Authenticity: Which specific device made this recording?
Sequentiality: Is the order of audio samples mathematically guaranteed?
Immutability: Has any alteration occurred since capture?
Custody: Who has handled this data, and when?

We have an intuition that the Linux audio community can figure out how to integrate Tycho guarantees at the digital-physical border where analog becomes digital. By hashing and signing audio the moment it leaves the ADC, before any other software can touch it, authentic sound becomes possible.⁵

If this is something you'd like to work on with us, please get in touch.

¹ For a nuanced account, see: Beyond the deepfake hype: AI, democracy, and "the Slovak case." ↩

² You can listen to the call yourself here: Fake Biden robocall tells voters to skip New Hampshire primary ↩

³ See the now classic paper by Chesney, Robert, and Danielle Keats Citron, Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security. ↩

⁴ C2PA timestamps rely on the signer's own clock, with no anchor to external consensus time, so backdating is trivial. ↩

⁵ Thank you to the extraordinary audio tools maker Billy Putnam for his advice on this topic. ↩