If a recording of someones very rare voice is representable by mp4 or whatever, could monkeys typing out code randomly exactly reproduce their exact timbre+tone+overall sound?
I don’t get how we can get rocks to think + exactly transcribe reality in the ways they do!
Edit: I don’t get how audio can be fossilized/reified into plaintext
Yes, monkeys could type out the zeros and ones. In fact we (not the monkeys) kind of did. There is a library of babel for audio named the sound library of babel which contains every 15 seconds audio recording you can imagine. Every single one. Almost all of them are white noise, but still there are recordings of every human saying any words in 15 seconds.
I call bullshit on that. Every second there are 44100 samples of 8 bit, so every second of sound is 44100 bytes, or 44kB. Even 1 second of audio is impossible to generate all possibilities.
To put this in perspective, there’s something called Universally Unique Identifier (UUID for short), one of them is 128 bits, or 16 bytes. Let’s imagine these were 1 bit long, on the second attempt at generating an id you would have a 50% chance of generating a repeated one, which means that by the third one you generate the chances that you have already generated a repeated id are 50%; If we extend this to 1 byte (i.e. 256 possibilities) the second time you have 1/256 chance of generating a repeated one, the second time 1/255, so on, and so forth. So from the third one on your chances of having already generated a duplicated id are 1/256 + 1/255 + 1/254 + … This means that by the 103th id you generate you have a 50% chance to have already generated a repeated one; why did I do those examples? Because a UUID has 16 bytes, this means that if you generated a billion UUID per second, it would take you 100 years to have a 50% chance of having generated a repeated one, and by that time you would need 43 ZB of storage (that’s not a typo, it’s Zettabytes as in 1024 EB (which is also not a typo, that’s Exabytes which is 1024 PB (which is also not a typo, that’s Petabytes which is 1024 TB, or Terabytes which is the first measure people are likely to be familiar with))).
Let me again try to put this in perspective, if Google, Amazon, Microsoft and Facebook emptied all of their storage just for this, they wouldhave around 2 Exabytes, so you would need a company 4300x larger than that conclomerate to have enough space to store the amount of unique ids that would be generated from a 16 byte random data (until you have a 50% chance of generating a repeated one).
Another way of thinking about this is that to store all of the possible combinations of 1 bit you need 2 bits of space, for 2 bits is 4, for 3 bits is 8, it goes on exponentially, so that for n bits is 2^n. For the UUID that is 3.4E38, or 3.5E13 YB (again, not a typo, that’s 1024 Zettabytes), i.e 35000000000000 YB (I could go up a few more orders of magnitude, but I think I made my point). And this is for 128 bits, every bit doubles that amount.
So again, I call bullshit that they have all possible sounds for even 1 second which is almost 3x that amount.
I appreciate the interest in doing all the math, and I am also not specifically familiar with audio or the audio library, but I believe you could use a similar argument against the OG library of babel, and I happen to know(confidently believe?) that they don’t actually have a stored copy of every individual text file “in the library”, rather each page is algorithmically generated and they have proven that the algorithm will generate every possible text.
I’d wager it’s the same thing here, they have just written the code to generate a random audio file from a unique input, and proven that for all possible audio files (within some defined constraints, like exactly 15 seconds long), there exists an input to the algorithm which will produce said audio file.
Determining whether or not an algorithm with infrastructure backing it counts as a library is an exercise left to the reader, I suppose.
The claim was it “contains every 15 seconds audio recording you can imagine. Every single one.”. Which is bullshit, that’s like saying this program contains every single literally work:
import sys print(sys.argv[1])
It’s just adding a layer of encoding on top so it feels less bullshity, something like:
def decode(number: int): out = "" while number: number, letter_index = divmod(number, len(string.printable)) out += string.printable[letter_index] return out
That also does not contain every possible (ASCII) book, it can decode any number into a text, and some numbers happen to contain texts that are readable.
deleted by creator
Short answer: to record a sound, take samples of the sound “really really often” and store them as a sequence of numbers. Then to play the sound, create an electrical signal by converting those digital numbers to a voltage “really really often”, then smooth it, and send it to a speaker.
Slightly longer answer: you can actually take a class on this, typically called Digital Signal Processing, so I’m skipping over a lot of details. Like a lot a lot. Like hundreds of pages of dense mathematics a lot.
First, you need something to convert the sound (pressure variation) into an electrical signal. Basically, you want the electrical signal to look like how the audio sounds, but bigger and in units of voltage. You basically need a microphone.
So as humans, the range of pitches of sounds we can hear is limited. We typically classify sounds by frequency, or how often the sound wave “goes back and forth”. We can think of only sine waves for simplicity because any wave can be broken up into sine waves of different frequencies and offsets. (This is not a trivial assertion, and there are some caveats. Honestly, this warrants its own class.)
So each sine wave has a frequency, i.e. how long many times per second the wave oscillates (“goes back and forth”).
I can guarantee that you as a human cannot hear any pitch with a frequency higher than 20000 Hz. It’s not important to memorize that number if you don’t intend to do technical audio stuff, it’s just important to know that number exists.
So if I recorded any information above that frequency, it would be a waste of storage. So let’s cap the frequency that gets recorded at something. The listener literally cannot tell the difference.
Then, since we have a maximum frequency, it turns out that, once you do the math, you only need to sample at a frequency of exactly twice the maximum you expect to find. So for an audio track, 2 times 20000 Hz = 40000 times per second that we sample the sound. It is typically a bit higher for various technical reasons, hence why 44100 Hz and 48000 Hz sample frequencies are common.
So if you want to record exactly 69 seconds of audio, you need 69 seconds × 44100 [samples / second] = 3,042,900 samples. Assuming space is not a premium and you store the file with zero compression, each sample is stored as a number in your computer’s memory. The samples need to be stored in order.
To reproduce the sound in the real world, we feed the numbers in the order at the same frequency (the sample frequency) that we recorded them at into a device that works as follows: for each number it receives, the device outputs a voltage that is proportional to the number it is fed, until the next number comes in. This is called a Digital-to-Analog Converter (DAC).
Now at this point you do have a sound, but it generally has wasteful high frequency content that can disrupt other devices. So it needs to get smoothed out with a filter. Send this voltage to your speakers (to convert it to pressure variations that vibrate your ears which converts the signal to an electrical signal that is sent to your brain) and you got sound.
Easy peazy, hundreds of pages of calculus squeezy!
could monkeys typing out code randomly exactly reproduce their exact timbre+tone+overall sound
Yes, but it is astronomically unlikely to happen before you or the monkeys die.
If you have any further questions about audio signal processing, I would be literally thrilled to answer them.
When you talk about a sample, what does that actually mean? Like I recognize that the frequency of oscillations will tell me the pitch of something, but how does that actually translate to a chunk of data that is useful?
You mention a sample being stored as a number, which makes sense, but how is that number utilized? Again assuming uncompressed, if my sample “value” comes up as 420, does that include all of the necessary components of that sound bite in a 1/44100th of a second? How would a sample at value 421 compare? Is this like a RGB type situation where you’d have multiple values corresponding to different attributes of the sample (amplitude, frequencies, and I’m sure other things)? Is a single sample actually intelligible in isolation?
When you talk about a sample, what does that actually mean?
First, the sound in the real world has to be converted to a fluctuating voltage. Then, this voltage signal needs to be converted to a sequence of numbers.
Here’s a diagram of the relationship between a voltage signal and its samples:
The blue continuous curve is the sine wave, and the red stems are the samples.
A sample is the value [1] of the signal at a specific time. So the samples of this wave were chosen by reading the signal’s value every so often.
Like I recognize that the frequency of oscillations will tell me the pitch of something, but how does that actually translate to a chunk of data that is useful
One of the central results of Fourier Analysis is that frequency information determines the time signal, and vice versa [2]. If you have the time signal, you have its frequency response; you just gotta run it through a Fourier Transform. Similarly, if you have the frequencies that made up the signal, you have the time signal; you just gotta run it through an inverse Fourier Transform. This is not obvious.
Frequency really comes into play in the ADC and DAC processes because we know ahead of time that a maximum useful frequency exists. It is not trivial to prove this, but one of the results of Fourier Analysis is that you can only represent a signal with a finite number of frequencies if there is a maximum frequency above which there is no signal information. Otherwise, a literally infinite number of numbers, i.e. an infinite sequence, would be required to recover the signal. [2]
So for sampling and representing signals, the importance of frequency is really the fact that a maximum frequency exists, which allows our math to stop at some point. Frequency also happens to be useful as a tool for analysis, synthesis, and processing of signals, but that’s for another day.
You mention a sample being stored as a number, which makes sense, but how is that number utilized?
The number tells the DAC how big a voltage needs to be sent to the speaker at a given time. I run through an example below.
Again assuming uncompressed, if my sample “value” comes up as 420, does that include all of the necessary components of that sound bite in a 1/44100th of a second? How would a sample at value 421 compare?
The value of a sample with value 420 is meaningless without specifying the range that samples are living in. Typically, we either choose the range -1 to 1 for floating point calculations, or 2^(n-1) to (2^(n-1) - 1) when using integer math [7]. If designed correctly, a sample that’s outside the range will be “clipped” to the minimum or maximum, whichever is closer.
However, once we specify a digital range for digital signals to “live in”, if the signal value is within range, then yes, it does in fact contain all the necessary components [6] for that sound bite in a 1/44100th of a second?
As an example [3], let’s say that the 69th sample has a value of 0.420, or x[69]=0.420. For simplicity, assume that all digital signals can only take values between Dmin = -1 and Dmax = 1 for the rest of this comment. Now, let’s assume that the DAC can output a maximum voltage of Vmax = 5V and a minimum voltage of Vmin = -7V [4]. Furthermore, let’s assume that the relationship between the digital signal is exactly linear, and the sample rate is 44100Hz. Then, ([69+1]/44100) seconds after the audio begins, regardless of what happened in the past, the DAC will be commanded to output a voltage Vout (calculated below) for a duration of (1/44100) seconds. After that, the number specified by x(70) will command the DAC to spit out a new voltage for the next (1/44100) seconds.
To calculate Vout, we need to fill in the equation of a line.
Vout(x) = (Vmax - Vmin) / (Dmax - Dmin) × (x - Dmin) + Vmin
Vout(x) = (5V - (-7V)) / (1 - (-1) × (x - (-1)) + (-7V)
Vout(x) = 6(x + 1) - 7 [V]
Vout(x) = 6x + 6 - 7 [V]
Vout(x) = 6x - 1 [V]
As a check,
Vout(Dmin) = Vout(-1) = 6×(-1) - 1 = -7V = Vmin ✓
Vout(Dmax) = Vout(1) = (6×1) - 1 = 5V = Vmax ✓
At this point, with respect to this DAC I have “designed”, I can always convert from a digital number to an output voltage. If x>1 for some reason, we output Vmax. If x<1 for some reason, we output Vmin. Otherwise, we plug the value into the line equation we just fitted. The DAC does this for us 44100 times per second.
For the sample x[69]=0.420:
Vout(x[69]) = 6•x[69] - 1 [V] = 6×0.420 - 1 = 1.520V.
A sample value of 0.421 would yield Vout = 1.526V, a difference of 6mV from the previous calculation.
And how does changing a sample from 0.420 to 0.421 affect how it’s going to sound? Well, if that’s the only difference, not much. They would sound practically (but not theoretically) identical. However, if you compare two otherwise identical tracks except that one is rescaled by a digital 1+0.001, then the track with the 1+0.001 rescaling will be very slightly louder. How slight really depends on your speaker system.
I have used a linear relationship because:
- That’s what we want as engineers.
- This is usually an acceptable approximation.
- It is easy to think about.
However, as long as the relationship between the digital value and the output voltage is monotonic (only ever goes up or only ever goes down), a designer can compensate for a nonlinear relationship. What kinds of nonlinearities are present in the ADC and DAC (besides any discussed previously) differ by the actual architecture of the ADC or DAC.
Is this like a RGB type situation where you’d have multiple values corresponding to different attributes of the sample (amplitude, frequencies, and I’m sure other things)?
Nope. R, G, and B can be adjusted independently, whereas the samples are mapped [5] one-to-one with frequencies. Said differently: you cannot adjust sample values and frequency response independently. Said another way: samples carry the same information as the frequencies. Changing one automatically changes the other.
Is a single sample actually intelligible in isolation?
Nope. Practically, your speaker system might emit a very quiet “pop”, but that pop is really because the system is being asked to quickly change from “no sound” to “some sound” a lot faster than is natural.
Hope this helps. Don’t hesitate to ask more questions 😊.
[1] Actually, it is ideally proportional to the value of the sample, what is termed a (non-dynamic) linear relationship, which is the best you can get with DSP because digital samples have no units! In real life, it could be some non-linear relationship with the voltage signal, especially if the device sucks.
[2] Infinite sequences are perfectly acceptable for analysis and design purposes, but to actually crunch numbers and put DSP into practice, we need to work with finite memory.
[3] Sample indices typically start at 0 and must be integers.
[4] Typically, you’ll see either a range of [0, something] volts or [+something, -something] volts, however to expose some of the details I chose a “weird” range.
[5] If you’ve taken linear algebra: the way computers actually do the Fourier Transform, i.e. transforming a set of samples into its frequencies, is by baking the samples into a tall matrix, then multiplying the sample matrix by a FFT matrix to get a new matrix, representing the weights of the frequencies you need to add to get back the original signal. The FFT transformation matrix is invertible, meaning that there exists a unique matrix that undoes whatever changes the FFT matrix can possibly make. All Fourier Transforms are invertible, although the continuous Fourier Transform is too “rich” to be represented as a matrix product.
[6] I have assumed for simplicity that all signals have been mono, i.e. one speaker channel. However, musical audio usually has two channels in a stereo configuration, i.e. one signal for the left and one signal for the right. For stereo signals, you need two samples at every sample time, one from each channel at the same time. In general, you need to take one sample per channel that you’re working with. Basically, this means just having two mono ADCs and DACs.
[7] Why 2^n and not 10^n ? Because computers work in binary (base 2), not decimal (base 10).
While I will be the first person to admit that I don’t completely understand either, the concept is pretty simple.
Think of a digital recording being something like sheet music. Sheet music is a set of instructions on how to play a song that anyone who knows how to read music can reproduce.
Digital recordings work in a similar fashion. The playback device reads the instructions which include things like frequency and volume, and is able use that information to make a perfect playback of the digital recording.
Could it be recreated by random chance? Sure. Would it? Probably not. At least not easily.
Long list of numbers in sequence. Each represents how far away from equilibrium the speaker cone should be, at each point in time, as it vibrates back and forth.
I just think its crazy I can record a random recording right now or me speaking and that can be stored in what must ultimately be good old-fashioned plaintext or whatever.
Like, thats a rock thinking and turning sound right into stone, wayyyyy more impressive and beneficial than alchemy turning lead into gold
Yes digital media, and computers in general, are miracles of science and engineering. Is there some reason digital audio in particular inspires you in this way, as opposed to digital images?
It doesn’t get encoded in to plaintext. First, the microphone picks up the sounds, and outputs values for frequencies and intensities. Recording software takes those values, and compresses them down into binary data. Then that binary data is saved onto storage. Depending on your storage, it’s then stored magnetically (cassette, floppy, HDD) or as a “lockable” logic gate (USB, SSD) or as laser etched dots and dashes (CD/DVD)
It’s not getting turned in to rocks, it’s getting written on media.
Also, some number for scale…
My computer has 3.5ghz processors. It can run 3.5 billion instructions every second. To put that in perspective, the smallest unit of time humans can perceive is ~13ms. That processor can run ~270,000 instructions in that time frame. Computers perform very simple tasks, extremely quickly, and it gives the impression of intelligence.
Its funny that human perception seems to be anecdotally tied to double digit milliseconds when if you ask any drummer or guitar player about input latency they’ll tell you that the absolute maximum round trip latency to be able to enjoy playing the instrument is in the range of 5ms.
Only once latency dips under 5ms does it start feeling “right”. Personally, I groan when I have to use anything over 3ms with my guitar as the second I hit high tempos the latency is unbearable.
Below 3ms it gets very hard to say that you can feel a difference.16th notes at 250bpm with 5ms latency has you approaching 10% of the note separation time. It’s 100% perceivable.
It’s kind of apples to oranges. Smoothness or variance is noticeable above discrete human ‘limits’. For a variety of reasons.
With music you have multiple types of feedback.
But how can it capture perfectly my exact voice or the exact timbre of whatever stuff is playing. Like, its mind-blowing to me and I have nothing i can analogize it to. Its incredible we can even take pictures with pixels, sound is just a whole notha level that astounds me
Maybe it helps to know that it can’t perfectly capture your voice. It can get close enough no human can tell the difference, but it’s still not perfect. First of all it has a sampling rate. To make this more understandable let’s think of a sample rate of 1 sample per second. Think of two speakers playing at the same time. One is playing your favorite song, the other is playing the exact note of that some for one second each and only changing notes every second. It’s going to somewhat mimic your song, but it’s going to be terrible. Now imagine that second speaker makes 4 samples every second, now it’s playing your song a quarter of a second at a time. Sounds a lot more like your song, but in the same way stop motion looks a lot like movement but isn’t right. Note up that sample rate to 100s or thousands of samples a second, now you’re getting to the point you can’t tell the difference, but it still can’t be perfect, because it’s still based on a sample rate.
If you can grok pictures from pixels, you can picture the same thing. If you averaged a picture out to one giant pixel, it’s unrecognizable, 4,8,16 pixels, maybe a simple icon starts to approximate into something recognizable. That little icon in your browser tab is usually 32 x 32 pixels. 1024 pixels total, and we barely consider that an image. It’s all about pixel count (sample rate). When you zoom in, you find that it’s not perfect, you always get to the point of individual pixels, unlike optical zoom where you can zoom almost indefinitely as long as you can collect enough light.
Everything about the exact timbre of your voice is captured in the waveform that represents it. To the extent that the sampling rate and bit depth are good enough to mimic your actual voice without introducing digital artefacts (something analogous to a pixelated image) that’s all it takes to reproduce any sound with arbitrary precision.
Timbre is the result of having a specific set of frequencies playing simultaneously, that is characteristic of the specific shape and material properties of the object vibrating (be it a guitar string, drum skin, or vocal chords).
As for how multiple frequencies can “exist” simultaneously at a single instant in time, you might want to read up on Fourier’s theorem and watch 3Blue1Brown’s brilliant series on differential equations that explores Fourier series https://www.youtube.com/watch?v=spUNpyF58BY
It’s doesn’t get your exact voice. Your speech gets compressed into digital “steps” that closely mimic the continuous “analog” output of your voice.
A microphone is a membrane attached to a means to generate electricity (like shaking wires around a magnet). When you make sound by a mic you shake the membrane and it in turn generates a small amount of electricity.
This electricity is an analog signal (it’s continuous, and the exact amount changes over time). We can take that signal and digitize it (literally chop it up into distinct digits) by using an ADC or analog to digital converter. Essentially an ADC takes a snapshot of the analog signal at a specific point in time, and repeats that snapshot process very quickly. If you take enough snapshots fast enough you can have a reasonable approximation of the original signal (like following a dotted line).
Now we have a digital signal and we can store those series of snapshots in a file.
But how do we turn that back into sound? We literally just follow the process in reverse.
We open the file and get the list of snapshots. We pass those to a DAC or digital to analog converter that generates a continuous analog signal that passes through every original point. We pass that signal to thin wire wrapped around a magnet and attached to a membrane. This mechanism takes the small generated electric field from the DAC and causes the membrane to shake in the same pattern that the mic originally shook in.
In practice there are often other steps in line such as amps to increase the strength of a signal or compression to minimize how much space the snapshots take up.
Edit: I don’t get how audio can be fossilized/reified into plaintext
https://en.wikipedia.org/wiki/Analog-to-digital_converter#Explanation
Basically sound is a change in air pressure, and we record that pressure value thousands of times a second. That’s basically a bunch of numbers, and how rocks/electricity represents that is ones and zeroes (binary).
Usually that data then gets compressed by using lots of smart maths. When you play that sound file, all that work is done backwards and your speakers produce the necessary pressure changes to make the sound.
Monkeys could randomly produce a perfect human sentence if they typed random stuff into a text file and it got converted appropriately. It’s just insanely unlikely.
deleted by creator