Davinsy, our autonomous machine learning system, integrates at its core Deeplomath, a deep learning kernel continuously learning from real-time live incoming data. This paper explains in which sense Deeplomath is a Deep Learning model generator.
Let us start defining what we mean by Deep Learning?
To answer this question, we need to define the concepts of structured and unstructured data first. Let us discuss this in the context of sound signals.
Structured vs. Unstructured data
Consider a sound signal and a classification problem (e.g. a user identification problem by voice biometry), a classification workflow using neural networks typically involves the following steps:
- acquisition of usually noisy’ signals,
- denoising using signal theory or dedicated recursive neural network,
- extracting Mel Frequency Cepstral Coefficients (MFCC) images,
- x or i-vector features (e.g. 256 size vector) extraction from MFCC images using a recursive neural network,
- classification (e.g. using PLDA).
The networks and the classifier have been previously trained and are static. If enough resource is available, through cloud access for instance, PLDA can adapt to the environment. The classifier finds who is the person talking or declares him/her as unknown.
We see that the workflow involves three signal transformations:
- Denoising where the signal does not change and remains an unstructured time series.
- MFCC where a time series in transformed into a semi-structured spectral signal.
- X-vector where the spectral signal is transformed to a fully structured vector.
By unstructured, we mean there is no presence of similar patterns in the signal. This is clearly the case of a voice speech time series. The following pictures show the transformation of a clean voice signal to MFCC with two intermediate states (Power spectrum and Mel power spectrum). One sees that some structures appear along the normal (frequency) axis and if one takes the average along the horizontal (frame) axis, the resulting curves are very much alike (here vectors of size 28). The distance between these vectors can be consistently defined. The picture on the right is the superposition of several averaged MFCC images for different speakers. The MFCC image is what we call semi-structured as the information has been structured at least along a privileged axis. Dimension reduction using privileged directions and symmetry are often used in physics (e.g. in depth averaging leading to Saint Venant shallow water equations).
The semi-structuring here is based on signal theory. This structuring can be enforced through feature extraction by a deep neural network. It appears that NNs are good information ‘structurers’ when they are trained on large databases. They can be used in a fixed way (we then speak of transfer learning): a network coming from the world of images (cat and dog), used on sound to structure the information (image of MFCC to x-vector). The network does not need to be necessarily trained for the search of hidden patterns on sound-based images specifically. What is requested is for the structuring to be as discriminating as possible. This is ensured and improved by training the NN on larger and larger datasets on one hand, and increasing the size and depth of the network on the other. We discussed this curse of dimensionality on our previous blogs on Deeplomath features and how Deeplomath permits to get around this bottleneck.
Deep Learning with Deeplomath
Deeplomath looks for distant and hidden correlations through the multi-scale analysis that deep learning allows building a specific function neural network. This is done based on what we known on ‘the world of the known’ and Deeplomath also identifies the boundary of this world which permits to reject unknown configurations (e.g. impostor rejection in speaker recognition through voice).
Deeplomath greatly simplifies the workflow. First, Deeplomath handles both semi-structured and structured data. This means that it can work either directly with MFCCs or with x-vectors. In addition, Deeplomath provides the following simplification to the workflow:
- Deeplomath embedded continuous learning remove the need for denoising.
- Deeplomath permits the use of lighter x-vector generator network.
- Deeplomath does not need an additional classifier.
The second point is important as current recursive networks used for the structuring of MFCC images are very large (several MB), clearly unsuitable for efficient embedding. Our experience shows that a second structuration improves the performances of Deeplomath. This can be chosen following the accuracy targets (security-oriented applications) or considering the available resources.
Finally, as Deeplomath learning is embedded and not requiring any data transfer, it permits both continuous learning of environment changes (e.g. noises) stopping model drift and also reinforcement learning taking into account user feedbacks and corrections.
Heterogeneous data
Industrial demands concern more and more situations where data is not only unstructured, but also heterogeneous (image, sound, mechanical motion, client-based analytical data, etc).
The Davinsy strategy consists of splitting a complex problem into simple ones. For example, suppose we have sensors for vibrations (IMUs), sounds (microphones) and images (cameras). These are three different unstructured entries gathered into one large heterogeneous one. Instead of building a large neural network handling this heterogeneous unstructured entry, Davinsy splits the problem into 3, building and continuously updating with ground truth, and user feedback when available, three models (with corresponding suitable architecture automatically identified) for the vibrations, sounds and images, each of the models receiving specific preprocessing structuring the dedicated information.
This has the advantage of keeping the embedded dataset small and updatable independently of each other. Also, the degradation in one of the sensors acquisition quality will only affect the corresponding prediction. Also, the treatment can be synchronous or asynchronous, sequential or parallel, as indeed, different acquisitions do not necessarily have a same time scale.
This is done by the Davinsy’s Application Service Layer offering a high-level software interface to describe the problems resolution and the conditional sequence or parallelism connecting virtual models. “Virtual” means these are templates, describing how to generate the evanescent models by the embedded Deeplomath Augmented Learning Engine (DALE). To define how those models connect we introduce “Application Charts” that are executed by the Application Service Layer. Application chart describes how the data flows through the models, choosing which model to execute following the outcome of other inferences.
Impact on few-shot learning
Few-shot learning requires to embed a ‘support’ labelled dataset. Bondzai’s DavinSy embed raw data to adapt to additive noises. If the noise changes, Bondzai rebuilds its network using the noise-augmented dataset. Also, few-shot learning needs a feature (or signature) extractor usually based on pretrained CNN. This is why competitors need powerful computational capacity (usually GPU based). If embedding is necessary, this network needs to be reduced and binarized, hence greatly losing its domain of validity. In other words, if adaptation is necessary, transfer learning capacity must be maintained which means some parts of the network should be left unbinarized for adaptation. Bondzai has its own feature extraction not subject to this limitation and compatible with execution on MCUs.