Building Soundbites
If you want to play around with this you can do so here: https://soundbites.uafed.com/
I’ve been thinking of starting this new project, tentatively named “Soundbites”. Admittedly, my main use case for this is quite specific and the tool itself might not even be necessary for what I need.
Basically, I recently purchased this book called “German in 3 Months” by Sigrid-B. Martin, published by DK Hugo. So far, I think the book is pretty good (though I am an absolute beginner, so you are free to take whatever I say with a grain of salt). I remember looking into other German books in the language section of book stores, yet they commonly seem to:
- Lack any substantial amount of exercises of various types or,
- Not provide a substantial amount of context, i.e., grammar.
The book gives a link to a free downloadable app, which provides a set of .mp3
files should you want to listen to them. The way these audio files are
structured is that each audio file is essentially a combination of snippets of
pronunciations of German words/phrases/lines combined into one, each separated
by a pause.
Now in addition to this book, I also use Anki as an additional practice tool, and I’d like to integrate these audio files especially to help me solidify the pronunciation. So, I started thinking of building this tool that can take these audio files, find points in time in which speech “occurs”, and allow the user to input what the text for the corresponding snippet/s is/are. This can then be used to generate an Anki spreadsheet containing the data for each instance. An example of this is the one below:
der Tisch / die Tische, table/s [audio: der_Tisch_die_Tische.mp3]
I say “snippets/s” since the audio files can potentially contain different variants of a word, e.g.: singular and the plural, adjective and its antonym, etc. Because of this, I also want to be able to control the exact rows this tool generates.
So if it “detects” speech at 00:00 to 00:02, and 00:03 to 00:05. I, as a
user, can either output one row for each of those snippets with the German +
English labeling that I specified or group those two words into one row, perhaps
with a draggable UI or something. The CSV and the audio files can then be
downloaded as one .zip file for importing into Anki.
Just to keep it simple, I’ll be focusing on generating CSV files for now instead of integrating with something like AnkiConnect.
In terms of the implementation: my initial idea, and what I’ve setup as a project so far is a React frontend (built with NextJS but planning to use it just to generate static sites) and a Rust “backend”. At this point, it’s really just a service since I don’t really need features such as persistence or auth for now. Plus, this would now give me an excuse to use Rust for a project.
Initial Design
As a small update, I’ve started the initial design process on Figma for what I’m envisioning on how the user flow for Soundbites will play out.
I’d prefer the UI to be as minimal and simple as possible. I should be able to quickly drag-and-drop a file, specify what I label each region and download. With that said, I’m trying to incorporate the idea that the user doesn’t need to see all possible actions at once, so they’ll be structured (at least for now) as this multistep form.
This is the initial design – quickly done in Figma – of step 1: should just be a simple drag and drop for uploading files or manually picking a file to upload.

Step 2 is likely the most complex on the UI side, as I’d need a way to not only show waveforms but regions within it that are not only highlighted, but interactive. The initial idea for now would be to have each region be clickable, and it opens up into a popup box to enter the details:

Another option would probably be to also show a row of items for each of the snippets below the waveform. Highlighting the row should highlight the waveform region and vice versa.
Implementation
I’ve actually backpedaled on my previous tech choice and switched from needing to implement an additional backend service to simply doing this all using the WebAudio API. It’s cool since you can essentially implement the silence detection just in the client. The main hurdle is needing to implement this in such a way that detecting these regions does not cause the UI to hang or lock up. This is where Web Workers come in. Web workers are essentially a way to offload heavy CPU-intensive computations onto another thread.
Web Workers
Using them in a Vite project is
also easy. You can even use Typescript as well as separate the code out into
different files, brought in using ESM imports. So your worker script can look
like this:
import JSZip from "jszip";
self.onmessage = (event: MessageEvent) => {
// ... do stuff
}
And in your React component:
import MyWorker from "./my-worker-script.js?worker"; // note the ?worker suffix
const myWorker = new MyWorker();
function MyComponent() {
const handleClick = () => {
myWorker.postMessage(...)
myWorker.onmessage = ...
myWorker.onError = ...
}
...
}
Still, you would need to show a loading indicator while the processing is being
done. You can manage this state in React, e.g., set some isLoading state once
you call postMessage and set it back once the onmessage callback is called.
On the other hand, another useful way to manage the state of “asynchronous”
workloads like this is to use something like Tanstack Query.
You’ll have to convert the flow of the worker to use a Promise for it to work,
but it essentially boils down to:
const { isPending, ...myMutation } = useMutation({
mutationFn: () => new Promise((resolve, reject) => {
myWorker.postMessage(...);
myWorker.onmessage = (event) => {
resolve(event.data)
}
myWorker.onerror = (err) => {
reject(err)
}
})
})