The new world of the Internet of Things (IoT) has brought us into an era where science fiction movie scenarios are quickly becoming a reality. Most of us remember scenes such as when Tony Stark interacted with Jarvis using natural speech, manipulated a 3D hologram with his hands, or was notified by Jarvis that key individuals had arrived. As amusing as these events were to watch, we have reached a point in time where we can actually create systems such as these in real life.
Advances in sensor technology have enabled automated devices to intelligently interpret sounds, images, and other data without the presence of extremely powerful processing. The result is that it is now possible to design compact, low power systems that can either enable increased autonomy in machine-to-machine (M2M) interactions, or offer increasingly natural perceptual user interfaces for human-machine interactions – user interfaces that interpret and react to gestures, natural speech, facial expressions, object recognition, and other non-traditional inputs to computing systems.
Early examples of perceptual user interfaces worked, yet they often came with unique constraints that made us long for more. Apple’s Siri, Google’s voice search, and Microsoft’s Kinect offered us new ways to interact with our devices, but conditions had to be just right for the technology to work as designed. Interpretation of voice queries required a quiet environment and was often handled in the cloud, which required a reliable broadband internet connection. Even when the early voice recognition systems worked as designed, the best results were still obtained by engineering the search phrase to focus on certain keywords. Optical tracking systems like the Kinect required good lighting conditions and a lack of reflective surfaces in their field of view. Without special tracking devices, these systems could still only capture large movements, and small gestures were all but ignored. Machine vision systems also had limitations – despite motion sensing features, motion detecting security systems could only tell that “something” had changed, and their high trigger rate required a human to review all events to determine the appropriate action.
Even M2M and IoT devices usually took a lot of effort to set up and/or control. The steps needed to program and/or control these devices via a smartphone or computer frequently negated the value of their “connected” status, particularly when compared to just walking up and controlling them directly. Even the latest and seemingly greatest innovations still have limitations. Amazon’s Echo and Google’s Hub have overcome constraints associated with background noise in far-field scenarios, and they can understand natural speech and context, but they can be accidentally activated if keywords are heard on commercials. Plus, they still rely on cloud connectivity for most of their audio processing.
Overcoming the Limitations of Yesterday’s Sensing Systems
The engineering world never stands still, however, and numerous companies have been working hard to bring advanced natural language processing and other intuitive controls to the Internet of Things. Embedded speech recognition, vision processing systems, and other sensing technologies have reached the point where our ideas about how perceptual user interfaces ‘should work’ are nearly reflected by how they actually DO work.
Natural language and advanced audio processors are allowing increased recognition of far-field spoken commands with intelligent direction detection and beamforming that can help to distinguish between spoken commands and false triggers from advertisements. On-chip audio processing is also becoming a reality, which is freeing devices from cloud-based voice interpretation systems and enabling autonomous operation that does not need an internet connection to operate. Conexant and Microsemi are two of the companies that are leading these innovations in natural voice processing. Conexant's AudioSmart technology leverages multi-channel, high dynamic range audio sensors to deliver leading far-field speech recognition for voice-interactive TVs and set-top boxes, smart appliances, telepresence, and gaming consoles. Microsemi's AcuEdge sensing solutions on their Timberwolf platform are bringing direction of arrival, beam forming, multi-channel echo cancellation, and other technologies to their far-field audio processors. This allows IoT, security, and automotive devices to operate in higher noise environments, pick out and prioritize the processing of important signals, and even distinguish between user commands and commands “overheard” from TV commercials and other audio sources that are not their owners, guests, or passengers.
New natural speech systems also possess features that allow connected IoT devices to be controlled via voice. The Amazon Echo and Google Hub are already able to use their natural language interfaces to allow users to interact with 2.4 GHz IoT devices via WeMo and Thread, opening up many opportunities for low power SoC-based gadgets that can complete tasks based on simple spoken commands. Belkin’s WeMo is built on top of 2.4 GHz 802.11 standards, making it easy to implement in IoT devices, but the higher power requirements of 802.11 make it more appropriate for appliances and other connected devices with outside power sources. Examples of low power 802.11 SoCs ideal for these applications include Cypress' CBM4390X family and Atmel’s ATWINC1510 IoT Module. Thread is newer wireless mesh IoT protocol built on top of 6LoWPAN and 802.15.4, making it ideal for ultra-low power IoT devices. It has several distinct advantages over other 802.15.4 protocols in that it is IP addressable and has built-in AES encryption for higher device security. NXP is a founding member of Thread, and their FRDM-KW41Z Freedom Development Board is a great way to quickly prototype Thread-connected devices.
Advances in high dynamic range (HDR) imaging sensors are coming out of OmniVision and ON Semiconductor’s Aptina brand, both of whom offer advanced sensing solutions for automotive, industrial, security, wearables, and other applications. These enable machine vision systems to capture higher quality video that is more easily parsed by image processors. This brings better facial detection, object tracking, and other advanced processing to camera systems and allows dramatically increased reliability of their detection and processing algorithms. The benefits are multi-fold: automated inspection and assembly systems are able to operate at faster speeds with higher quality, robots and drones can navigate with increased safety and autonomy, and security and surveillance systems can minimize or eliminate the need for operator review and interact directly with other machines and systems. Omnivision’s OV106xx line of HDR video capture devices and OnSemi’s AR0230 and AP020x HDR sensors and image processors enable devices to capture their surrounding environment with enough detail to intelligently respond, while avoiding problems due to bright light sources washing out the image and obscuring important details.
New Types of Sensors for User Interfaces
Even with these advances, however, machine vision systems can still be less than ideal for interacting with wearables or other devices that need to discern fine details like gestures. Low power capacitive and inductive sensing solutions exist and can meet some of these needs, but innovative applications of low power radar have introduced ultra-low power, precision gesture sensing for wearables, phones, computers, cars, and other IoT devices. These advances in radar technology are being developed by Google’s project Soli, a partnership between Infineon and Google that uses radar technology to enable virtual tool gestures – the operation of imaginary controls between the fingers of a hand. Buttons, dials, knobs, sliders, and other two- or three- dimensional movements can be virtualized, and the user’s own fingers provide haptic feedback to make the interaction feel very natural. One of the most impressive early demonstrations of this technology is a smartwatch demonstration that uses gestures mimicking the motions of using a crown to set an analogue watch – but without the necessity to actually touch the watch to adjust dials and other settings. This and other innovative virtual control mechanisms are highlighted in Project Soli’s introductory video.
These advances in sensing are bringing new capabilities to M2M and IoT devices, and making users’ interactions with these devices more natural than ever before. The possibilities are truly amazing, and can appear like something from science fiction that has been brought out of the movies and into real life.
If you want to learn more, subscribe to our newsletter to get the latest news.