Ben Wolfson

When Music Becomes Typing

Can we hear the letters typed on a keyboard at 170+ wpm?

Benjamin Wolfson

Introduction

I began this research with the intention of finding an acoustic side-channel attack capable of context-free keystroke recovery at 170+ wpm.

Existing side-channel attacks in typing contexts largely ignore the adversarial nature of typing speed, which greatly influences the quality of data collection. At the tail end of the words-per-minute spectrum, typing transforms from a mostly discrete process into a continuous stream of precisely controlled mechanical impulses, where the keyboard becomes a physically resonant extension of the user under continuous deformation. Recovering keystroke location from acoustic emanations under these conditions is likely analytically intractable with previously explored techniques.

However, careful waveform analysis of the acoustical microstructures in just a single 175 wpm typing test's audio paved a path to deeper interpretable mechanics, leading to a reformulation of the problem into one bounded by physical process, not just probabilistic search.

Typing in Cursive

The act of typing is a linear sequence of fine motor events when observed directly. A user presses a letter key, then another letter key, and eventually the space bar to form a word—repeating the process enough times until a thought has been transcribed.

When observed indirectly through acoustic emanations, the task of perception transforms into non-linear space. The sound wave produced by a single keystroke has irregularities that we register as either pleasant, neutral, or unpleasant, differentiated across a variety of qualitative attributes. However if one types fast enough, the human ear loses its ability to parse keystrokes individually. This is typing in cursive: the moment when typing ceases being heard as a series of disconnected clicks, and becomes a flowing, interconnected sheet of variable frequencies, i.e. music.

Defining Typing State

To better understand what I mean by typing in cursive, let's build a transition diagram of the typing process starting with a simplified example of the discrete form.

We might observe the following emissions sequence from a novice typist who uses the hunt-and-peck approach:

Figure 1: Discrete typing: a series of distinct keystrokes that produce a series of distinct emissions.

$\Delta t_{i}$ : Transition latencies.

$k(t)$ : A series of discrete keypresses as a function of time.

$e(t)$ : A continuous function of corresponding sound emissions. This is what the audio waveform from a single microphone placed in proximity to the keyboard would look like.

When one types quickly enough however, the aftershocks of the impulse delivered by each keystroke to the keyboard chassis reverberate long enough to be interrupted by the next keystroke, resulting in a smearing effect. Translating the previous example to its cursive counterpart, we might observe the following from a proficient touch typist:

Figure 2: Typing in cursive.

In essence, this analogy describes the reduction of the signal-to-noise ratio (SNR) in the typing emissions function, with our previous examples representing two poles on the spectrum of signal clarity. It also allows us to define the emissions spectrum more formally with the following asymptotic transformation:

To further develop this concept, we need a new framework for thinking about what a keyboard really is when reduced to its fundamental physical processes.

The Keyboard as a Resonant Manifold

If we consider the keyboard as a manifold, the act of typing causes this manifold to deform asymmetrically through time. From a top-down perspective, this manifold is just a two-dimensional grid of keys. By extension, there must be high similarity between adjacent keys in their fundamental properties, where each key possesses its own unique resonance when actuated under ideal conditions.

Figure 3: A top-down perspective of the Happy Hacking Keyboard (HHKB), famous for its all-plastic construction and Topre keyswitches, which produce a distinctive "thock" when typing. We would expect "A" to have a closer resonance with "Q" than "P", for instance, with each key having its own unique sound profile.

Given a standardized keystroke impulse, the extent of this deformation is determined by several factors, but is primarily facilitated by the Young's modulus of the PCB, backplate, and chassis, how these components transfer vibrations between one another, and the interaction of dampening mechanics created by contact points with any external object—this includes the desk, a user's fingers or palms (through prolonged contact), or even a wrist rest. Since typing faster necessitates higher force throughput and quicker finger rebounding, it stands to reason that at the tail end of the wpm distribution, the shape of a keystroke's impulse begins to asymptotically approach that of an ideal.

This is not to say that the ideal impulse is the ideal keystroke, however—the ideal keystroke is the ideal impulse with respect to the keyboard.

Effectively, the keyboard is an externally-excited speaker. Under the fingertips of the right person, it becomes an instrument.

Data Collection

Typing is the result of a central nervous system trained to execute a fine motor skill that demands precision, accuracy, and speed.

Whereas a novice typist's nervous system treats words on a screen to be transcribed as a series of distinct keypresses to be queued in a segmented fashion, a speed typist's nervous system has been highly trained in coarticulation over hundreds of hours—to the point where the queue becomes more chunked—first with bigrams, trigrams, quadgrams (is there such a thing?), then eventually words, word-pairings, and so on. The brain effectively preloads these paths to be navigated by the fingers with increasing involvement of working memory with an optimization function of speed, accuracy, and even economy of motion. Unfortunately there are no widely available data on these topics.

The nature of this research therefore necessitated generating data myself, where I had complete control over the environment, microphones, and typing subject matter.

I decided to use randomly generated top-k word tests through MonkeyType.com for data collection. Unlike quote transcription tests, this approach is standardizable and repeatable, and should effectively scramble the chunking mechanism mentioned earlier since it is purely artificial data. In other words: the top-k test effectively represents the typing equivalent of Gaussian noise, whereas a quote transcription test represents sheet music.

In my analysis I looked at individual 48kHz samples of audio in the following typing test with a 175wpm result and 99.66% accuracy.

$T_{1}$ : Korg CM-300 Vibration Sensor

$T_{2}$ , $T_{3}$ : Sony ECM-44B Omnidirectional Lavalier Microphone

Waveform Analysis

The scope of difficulty in working with real-world data at these wpm becomes evident when we look at a breakdown of the first 250 milliseconds of data from the previous clip.

Figure 4: The first 250ms of emissions captured from the three-mic straddle configuration.

With only 5 keystrokes, there are 7 major spikes in the vibration sensor's data, as well as several minor transients. The acoustic emanations ( $T_{2}$ and $T_{3}$ ) showcase a multitude of variable frequencies that resemble the cursive emissions function in Figure 2. And without the defined vibrational microstructures in $T_{1}$ , it would likely be impossible to determine true keystroke onset.

Ultimately more research is needed to explore the possibility of whether recovering keystroke information is possible under these conditions.

Written June 2026