In 2022, my advisor and I published a paper called "Evaluating generative audio systems and their metrics" at ISMIR 2022, where we discussed this set of problems with evaluation of neural audio synthesis techniques at large:
- What are the metrics?
- Do any of them line up with perception?
- How do we evaluate these systems?
While running the experiments necessary to extract sounds and their corresponding metrics, I noticed that the tooling around extracting metrics was not good. As an example, if I were to compute the Frechet Audio Distance, I'd have to do the following:
- Decide if I'm using Tensorflow (the reference implementation) or Pytorch (everyone's favorite framework)
- Setup google's entire research repo
- Spend time setting up the environment for one directory
- Break everything because of tensorflow and numpy
- Fix everything somehow
- Find that I now have to setup the VGGish model too?
- Spend time setting up the environment for another directory
- Once VGGish is setup, I can finally extract the embeddings
Broadly, this is a significantly terrible way to extract metrics in my opinion. This is so far behind the curve compared to the tooling available to evaluate things like images and text. For instance, you can just download torchmetrics and use it directly to evaluate your models (which is great!) and while torchmetrics does come with a built in set of metrics for audio, it's not as exhaustive per se.
So, I decided to build a toolkit that would allow me to extract metrics from audio files in a way that is easy to use and easy to extend. It's available on Github. It will be on PyPi soon (check back in a few days).
In the meanwhile, it has support the following metrics:
FAD | Frechet Audio Distance |
KID | Kernel Inception Distance |
PEAQb | Basic PEAQ |
NDB/k | Number of Different Bins over K |
SISDR | Scale-Invariant SDR |
SNR | Signal-to-Noise Ratio |
MAE | Mean Absolute Error |
MSE | Mean Squared Error |
KL | Kullback-Leibler Divergence |
It also has a cool Python port of PEAQ!
I'm still working on adding documentation, more metrics and improving the code quality. If you have any suggestions, please feel free to open an issue on Github!
Looking forward to you using the toolkit!