Machine Learning: the Lab vs the Real World
- Stuart Feffer, ceo + co-founder
- Jan 17, 2017
- 5 min read
"In theory there's no difference between theory and practice. In practice there is."
-- Yogi Berra

Not long ago, TechCrunch ran a story reporting on Carnegie Mellon research showing that an “Overclocked smartwatch sensor uses vibrations to sense gestures, objects and locations.” These folks at the CMU Human-Computer Interaction Institute had apparently modified a smartwatch OS to capture 4 kHz accelerometer waveforms (most wearable devices capture at rates up to 0.1 kHz), and discovered that with more data you could detect a lot more things. They could detect specific hand gestures, and could even tell a what kind of thing a person was touching or holding based on vibrations communicated thru the human body. (Is that an electric toothbrush, a stapler, or the steering wheel of a running automobile?”)
Join our webinar 2/15 @ 3pm Eastern:
"Duh!"
To those of us working in the field, including those at Carnegie Mellon, this was no great revelation. “Duh! Of course, you can!” It was a nice-but-limited academic confirmation of what many people already know and are working on. TechCrunch, however, in typical breathless fashion, reported as if it were news. Apparently, the reporter was unaware of the many commercially available products that perform gesture recognition (among them Myo from Thalmic Labs, using its proprietary hardware, or some 20 others offering smartwatch tools). It seems he was also completely unaware of commercially available toolkits for identifying very subtle vibrations and accelerometry to detect machines conditions in noisy, complex environments (like our own Reality AI for Machine Health), or to detect user activity and environment in wearables (Reality AI for Wearables).
But my purpose is not to air sour grapes over lazy reporting. Rather, I’d like to use this case to illustrate some key issues about using machine learning to make products for the real world: Generalization vs Overtraining, and the difference between a laboratory trial (like that study) and a real-world deployment.
Generalization and Overtraining
Generalization refers to the ability of a classifier or detector, built using machine learning, to correctly identify examples that were not included in the original training set. Overtraining refers to a classifier that has learned to identify with high accuracy the specific examples on which it was trained, but does poorly on similar examples it hasn't seen before. An overtrained classifier has learned its training set “too well” – in effect memorizing the specifics of the training examples without the ability to spot similar examples again in the wild. That’s ok in the lab when you’re trying to determine whether something is detectable at all, but an overtrained classifier will never be useful out in the real world.

Illustration from the CMU study using vibrations captured with an overclocked smartwatch to detect what object a person is holding.
[endif]--Typically, the best guard against overtraining is to use a training set that captures as much of the expected variation in target and environment as possible. If you want to detect when a type of machine is exhibiting a particular condition, for example, include in your training data many examples of that type of machine exhibiting that condition, and exhibiting it under a range of operating conditions, loads, etc.
It also helps to be very skeptical of “perfect” results. Accuracy nearing 100% on small sample sets is a classic symptom of overtraining.
It’s impossible to be sure without looking more closely at the underlying data, model, and validation results, but this CMU study shows classic signs of overtraining. Both the training and validation sets contain a single example of each target machine collected under carefully controlled conditions. And to validate, they appear to use a group of 17 subjects holding the same single examples of each machine. In a nod to capturing variation, they have each subject stand in different rooms when holding the example machines, but it's a far cry from the full extent of real-world variability. Their result has most objects hitting 100% accuracy, with a couple of objects showing a little lower.
Small sample sizes. Reuse of training objects for validation. Limited variation. Very high accuracy. … Classic overtraining.
Detect overtraining and predict generalization
It is possible to detect overtraining and estimate how well a machine learning classifier or detector will generalize. At Reality AI, our go-to diagnostic is the K-fold Validation, generated routinely by our tools.
K-fold validation involves repeatedly 1) holding out a randomly selected portion of the training data (say 10%), 2) training on the remainder (90%), 3) classifying the holdout data using the 90% trained model, and 4) recording the results. Generally, hold-outs do not overlap, so, for example, 10 independent trials would be completed for a 10% holdout. Holdouts may be balanced across groups and validation may be averaged over multiple runs, but the key is that in each iteration the classifier is tested on data that was not part of its training. The accuracy will almost certainly be lower than what you compute by applying the model to its training data (a stat we refer to as “class separation”, rather than accuracy), but it will be a much better predictor of how well the classifier will perform in the wild – at least to the degree that your training set resembles the real world.
Counter intuitively, classifiers with weaker class separation often hold up better in K-fold. It is not uncommon that a near perfect accuracy on the training data, drops precipitously in K-fold while a slightly weaker classifier maintains excellent generalization performance. And isn’t that what you’re really after? Better performance in the real world on new observations?
Getting high class separation, but low K-fold? You have a model that has been overtrained, with poor ability to generalize. Back to the drawing board. Maybe select a less aggressive machine learning model, or revisit your feature selection. Reality AI does this automatically.
Be careful, though, because the converse is not true: A good K-fold does not guarantee a deployable classifier. The only way to know for sure what you've missed in the lab is to test in the wild. Not perfect? No problem: collect more training data capturing more examples of underrepresented variation. A good development tool (like ours) will make it easy to support rapid, iterative improvements of your classifiers.
Lab Experiments vs Real World Products
Lab experiments like this CMU study don’t need to care much about generalization – they are constructed to illustrate a very specific point, prove a concept, and move on. Real world products, on the other hand, must perform a useful function in a variety of unforeseen circumstances. For machine learning classifiers used in real world products, the ability generalize is critical.
But its not the only thing. Deployment considerations matter too. Can it run in the cloud, or is it destined for a processor-, memory- and/or power-constrained environment? (To the CMU guys – good luck getting acceptable battery life out of an overclocked smartwatch!) How computationally intensive is the solution, and can it be run in the target environment with the memory and processing cycles available to it? What response-time or latency is acceptable? These issues must be factored into a product design, and into the choice of machine-learning model supporting that product.
Tools like Reality AI can help. R&D engineers use Reality AI Tools to create machine learning-based signal classifiers and detectors for real world products, including wearables and machines, and can explore connections between sample rate, computational intensity and accuracy. They can train new models and run k-fold diagnostics (among others) to guard against overtraining and predict ability to generalize. And when they’re done, they can deploy to the cloud, or export code to be compiled for their specific embedded environment.
R&D engineers creating real-world products don’t have the luxury of controlled environments – overtraining leads to a failed product. Lab experiments don’t face that reality. Neither do TechCrunch reporters.
![endif]--
Comments