Paper Reading - A Universal Music Translation Network, from Facebook AI Research

3 min readMay 25, 2018

A few days ago, Facebook AI Research published a paper on arxiv proposing a framework enabling universal music domain transfer. It always draws people’s attention when you claim your works to be “universal”, so let’s check it out! Before we dig into the paper, let’s first check out the sample results of this paper on youtube.

Pretty cool, huh? But wait.

While it is true that they have produced a high-quality music domain transfer, they are indeed over-claiming the power of their model for music translation. To be precisely, what they have done is to make a song sounds like it was performed by different configurations of instruments. For example, they converted a piece of Bach’s organ works into a generated piece which sounds like it was Beethoven’s piano music. The melody and chorus remain, but the textures of sound (instrumentation) differ.

So why do I say they over-claimed their model? Since the results are more like timbral-texture transfer (I learned this term from someone’s FB post) instead of music style or domain transfer. In the example mentioned above, they were actually producing new clips sound like Beethoven’s piano “timbre-wisely”, instead of transforming the composition to Beethoven’s style.

Despite the over-claim, this is still a interesting work. They adopted the Wavenet auto-encoder (AE) architecture similar to that being used in the Nsynth project from Google Magenta. Wavenet is an autoregressive model that process raw audio, predicting next audio sample based on the previous generated ones.

Fig. 1 Wavenet Auto-encoder Model from Nsynth

They modified the 1–1 auto-encoder to suit the purpose of domain transfer, with one universal encoder that compress audio from different domains into domain-invariant latent codes and k decoders used for generating audio in each domain i, where i = 1 , 2 , 3……k

Then we may wonder, how did they manage to create a universal encoder for all domains? The trick they used is the adversarial training between AE and Domain Classification Network (DCN) showed in Fig. 2. The DCN was trained to classified the original domain of the input latent representations. Therefore, by adding the adversarial terms to the loss of AE, the competition between AE and DCN will force AE to learn to discard the domain information and compress the inputs into domain-invariant latent representations.

So let’s take a look at the training loss. Let s^j be an input sample from domain j = 1, 2, . . . , k, k being the number of domains employed during training. Let E be the shared encoder, and D^j the WaveNet decoder for domain j. Let C be the domain classification network, and O(s, r) be the random augmentation procedure applied to a sample s with a random seed r (pitch shift to prevent overfitting). L(o, y) is the cross entropy loss.

Eq. 1 shows the adversarial loss of the proposed model, the first term is the reconstruction loss and the second term is domain classification term. We favor reconstruction loss to be small and domain classification loss to be bigger. This is where the adversarial training stands out to enable the power of this model!

Eq. 1 The loss of proposed auto-encoder model

Now I have made a brief conceptual summary of this paper, and I will leave the rest details regarding the experiments and findings to the original paper on arxiv. Remember to check it out!

Although this paper is suspicious to be exaggerated on describing their works, it still stands as a clear step stone that can potentially inspire and attract more efforts into deep learning based audio domain music technology. Last but not the least, it’s a good material to learn how to consecrate your own piece of works ;)

Paper Reading - A Universal Music Translation Network, from Facebook AI Research

Written by Bryan Wang