PaperSynth

Python
Keras
numpy
Swift
CoreML
Vision
AudioKit

Source code: https://github.com/ashvala/PaperSynth


Introduction:

PaperSynth is a project that uses Machine Learning and Computer Vision to do sound design. The primary premise of the app revolves around reading keywords written in a sequential order and recreating them on the phone.

The use case I wanted to accomplish with this app was to help my sound design tutoring sessions. Previously, when I was at Berklee as a tutor, I intended to develop tools that would assist in teaching programming and sound design. An initial idea intended to apply online-handwriting recognition and develop an app analogous to Pure Data and Max/MSP for the modern day tablets and phones.

However, a far more interesting use case ended up being able to take photos of these “stacks” and make them performable on the phone. In a perfect storm of technological advances, machine learning and computer vision became far more ubiquitous with the release of Keras (a binding to Tensorflow) on Python and CoreML/Vision on iOS. The previously sparse resources for these topics became more easily available.

How it works:

The first component in this project is the handwriting recognition. Handwriting Recognition is done using the Chars74K dataset from University of Surrey. They have two datasets - one of them handles character recognition in outdoor environments with varied typesets and camera optics. The other Chars74k dataset handles text data written with a stylus on a digital tablet for handwriting recognition tasks.

The dataset was converted to a MNIST-style formatted dataset, where the labels and data are encoded as numpy arrays (numpy is a python library with a direct C interface to facilitate faster calculations and data manipulation). The labels are encoded as One-hot encoded vectors and get represented as a series of 0s and the label itself is encoded as a 1. Prior to training, the data was augmented to allow for many variations and permutations of the same label and increase the number of viable circumstances under which classification would be possible

The dataset was trained with a convolutional neural net. The design for the neural net is analogous to a network that would be used for an MNIST training task. The network achieved a score of 89.74% over 80 epochs (iterations over the dataset and its various augmentations). Following training, the dataset is converted to Apple’s CoreML format. CoreML automatically generates the code necessary for using the model one has trained in an app and exposes Swift/Objective-C APIs for the model.

To do bounding boxes for letter detection, I used Apple’s new Vision API. Apple’s Vision API provides an API to both detect text in an image and to create bounding boxes around words/characters. However, it does not detect the text in this frame. cropped data from each letter in a word is fed to the Machine Learning Model. The Machine Learning model predicts what the character might be and returns the most probable result.

The app implements Levenshtein distances to calculate edit distances between a detected word and a keyword that has been implemented. Once a satisfactory result has been attained - the app generates the synthesizer using AudioKit, an iOS/macOS/tvOS specific framework that makes interacting with CoreAudio far simpler. The UI is generated for each of the supported keywords and connects them up in a paradigm similar to how a stack would work - the first element in the stack handles input and the output is generated through the last element in the stack. Each element is connected sequentially to the next.

Design:

A design inspiration for the project was Instagram’s story camera - it allows you to do two things, take a picture or use a picture you have already captured. This is a user interface paradigm that most people who are familiar with current trends in social media are used to.

A User Experience aspect I wanted to explore was using photographs as save files. In general, my instinct here was to try and make the learning curve for synthesis as simple as possible. If a photograph has text that the app detects easily, this can just be shared with another user and it will always be reproducible. More importantly, this allows for creating visual programming systems that aren’t merely just digital, but, are also analog. The premise behind programs like Max/MSP and Pure Data revolved around being able to print patches you make and then reproduce them by hand on the computer.

However, since node-based interfaces are not particularly easy to interact with on touch only devices of today, I used pen and paper to simulate the experience.

The user interface for the app is generated using UIKit/CoreGraphics. The code creates user interface elements that create affordances for interactivity and event handling. The UI was developed to be similar to iOS 11’s new Control Center in order to provide a sense of familiarity to users.

Conclusion:

A proof of concept for the app is available on GitHub (https://github.com/ashvala/PaperSynth). The code includes Jupyter notebooks and code to help train the neural network. It also comes with all the swift code that goes behind marrying Vision/CoreML API with AudioKit.

Going forward, the roadmap includes: Improving the handwriting recognition system and increase its accuracy to work with blurrier and darker environments. Parse directed graphs in images Improved modularity on each of the audio components, allowing for multiple outputs and inputs - mostly to facilitate building interesting signal chains from the graphs.