Throughout history, artistic expression has been one of the foundational elements of culture. The individuals ability to express themselves through various forms of art such as painting, music and literature have been essential to expressing what it means to be human, and up until this point that is exactly who it was reserved for; humans.
In a previous feature article Music By Machines: AI and the business of making music we explored the use of AI to assist us in music making, and the technology is not slowing down – we are on the cusp of a new artificially augmented proliferation of artistic expression.
We’ve seen how painting, photography and literature can be improved with the help of artificial intelligence and robotic systems, but perhaps the most compelling form of creative expression yet to be properly explored with algorithmic creation and manipulation is music.
In a recent interview with Futurism, Zack Zukowski and CJ Carr – musicians and computer scientists – demonstrated the power of algorithmic manipulation of sound using Jukebox, an algorithm from OpenAI.
To demonstrate the algorithms ability to manipulate sound, Zukowski and Carr put the AI to the test by creating a Frank Sinatra cover of Britney Spears ‘Toxic’. The result is a somewhat believable vocal and instrumental track.
Despite its flaws, the final product is rather impressive. As Carr notes, rather than copying existing elements from it’s training data, the algorithm uses ‘neural synthesis’—While working at a sample rate of 44.1kHz, it attempts to learn from what it is being fed, then attempts to predict the next sample piece by piece.
While similar algorithms exist, Jukebox is unique in that it uses ‘transformer architecture’. This transformer architecture is often used in language models, and according to Carr, Jukebox is the first to apply it to music.
In addition to transformer architecture the algorithm uses what is referred to as a Vector Quantised Variational AutoEncoder—allowing the algorithm to memorise elements from an audio recording and map it into a 2048 symbol bank, essentially allowing it to create a language of sound elements, just as it would in a language model. As Zukowski puts it “They’ve come up with a new music theory.”
This new music theory—or language—isn’t some abstract result of sound analysis. As Zukowski and Carr note, music is already quite language-like, and human interpretation of sound is what led to our idea of music theory. As the algorithm learns and tries to make predictions, it picks up on similar patterns and predictable movements in the music. In this way, the AI is creating it’s own method of interpretation.
Both Zukowski and Carr highlight that the demo song – Frank Sinatra covering ‘Toxic’ – wasn’t entirely without its flaws, and that the algorithm still has major limitations. While Jukebox is able to order lyrics against an instrumental track, it still relies on existing data to create a realistic rendition of these various sounds. For example, Sinatras voice must hit certain notes in ‘Toxic’ which he was unlikely to hit in an actual recording. To compensate for this, the AI must come up with it’s own sound, which is unlikely to be completely accurate.
The algorithm also struggles with staying on track, often generating vocal loops or ignoring the chorus —it can get confused. Zuckowski says they encountered several such anomalies during their creation of the demo song. “One time I was generating Frank Sinatra, and it was clearly a chorus of men and women together. It wasn’t even the right voice. It can get pretty ghostly.”
The possibilities of such algorithms are endless, and with that endless ability to ‘create’, new problems emerge. The legalities of such software are hard to define; if two songs are mashed together, then who holds the rights? No matter the answer, it is clear that we may soon have to redefine what it means to create art.
Sources: