Xiaomi’s comprehensive self-study of acoustic speech technology

Recently, Xiaomi released Xiaoai speaker art. As the 9th smart speaker launched by Xiaomi, the acoustic voice technology behind Xiaoai speaker art has also been greatly upgraded. It carries the third generation of “Xiaoai students” to support emotional voice interaction, room wide broadcast and nearby wake-up. At present, Xiaomi’s acoustic speech technology has been fully self-developed, and continues to lead in some fields of self-study. < / P > < p > with the development of artificial intelligence technology, on the basis of realizing human-computer dialogue, major manufacturers are exploring the field of emotional voice interaction. “Emotion” itself is subjective and diverse feelings. For intelligent devices, emotional voice interaction is a challenge. Emotional voice interaction has high technical requirements, which requires the technical side, data side, quality control side and other parties to reach a consensus on the emotional concentration, emotional interpretation methods and other standards, so as to unify and standardize the more subjective emotional phonemes. < / P > < p > in order to add emotional elements to the machine, Xiaomi AI laboratory, on the premise of “limited emotional data”, finally launched a natural and anthropomorphic emotional TTS through different acoustic models and different vocoder combinations, becoming the first enterprise in the industry to implement emotional TTS on a large scale. < / P > < p > through the continuous development of Xiaomi AI laboratory, Xiaomi Xiaoai speaker art fully supports emotional voice interaction. Based on limited but different types of emotional audio data, such as happiness, concern, shyness, surprise, etc., through different technical training and iterative acoustic model, it finally supports emotional TTS synthesis and realizes erotic and personification of “Xiaoai schoolmate”. “Xiaoai schoolmate” with richer emotion can provide users with diversified voice interaction experience and add richer, more stereoscopic and more realistic voice interaction experience for IOT devices. < / P > < p > users directly say to “Xiaoai schoolmate” that “play XXX in the whole house”. Without manual setting in advance on the app, Xiaomi Xiaoai speaker art can realize one sentence voice interaction, providing users with a more convenient way to use. < / P > < p > in order to realize the function of voice supporting whole house broadcasting, the speaker needs to have aiot playback technology. In addition, it also provides a series of sound synchronization technologies, such as micro sound box and micro sound box, to solve the problem of synchronization between different models of micro sound box and micro sound box. < / P > < p > in addition, the stereo function supports both voice commands and app operation playback. During the demonstration, the reporter learned that after selecting the playing track, the cloud audio stream will be sent to speaker a, speaker a will separate stereo into left and right channels, speaker a will play left channel audio and send right channel audio stream to speaker B, and speaker B will play right channel audio. Precise synchronization technology ensures that speaker a and B can simultaneously play left and right stereo channel audio. The whole room play function supports voice command and app creation networking, and the audio stream is sent to speaker C, which mixes the audio stream into mono channel signal and sends it to other speaker equipment in the group for simultaneous playback. It can support multiple devices without distinguishing channels. The art of Xiaomi Xiaoai speaker supports wake-up technology of two wheat arrays synchronously. In terms of microphone array, Xiaomi adopts two wheat blind source separation noise reduction front-end. Through blind source separation, noise reduction, echo cancellation and other technologies, Xiaomi can combine voice enhancement technology to eliminate strong noise interference and obtain clean and accurate human voice audio in noisy environment with multiple sources and when the speaker plays music. In the wake-up aspect, in order to take into account the low power consumption and high performance, the self-developed voice wake-up algorithm adopts two-stage wake-up strategy. The low-power standby wake-up word detection model uses sub sampling and shared hidden layer technology to reduce the consumption of model resources and ensure a high recall rate. High performance false wake detection model, using coarse-grained modeling unit, combined with local information and long-term context information, can effectively suppress false wake-up. The robustness of the wake-up model in low SNR and low volume scenarios is improved by automatically mining high discriminative training samples from massive data and data expansion technology. < / P > < p > based on the wake-up technology upgrade, Xiaomi Xiaoai speaker art has become the first speaker in the industry to support cross device alarm clock shutdown. The reporter saw that if the remote speaker alarm clock rings, wake up the nearby speaker can directly turn off the remote alarm clock. < / P > < p > at present, the number of intelligent hardware connected to Xiaomi IOT platform has reached 250 million, and the shipment volume of speakers has reached 22 million. In such a large number of users, how to continuously improve the basic experience and enhance the innovation of products in AI experience is a very important mission for the self-developed AI team. It is reported that in the future, Xiaomi will focus on complex home scene applications, intelligent acoustic sensing and multi-sensor fusion. In the complex family structure environment, ensure the availability of the algorithm, let each device actively perceive the different environment, complete the algorithm adaptive according to the environment, and integrate the data results to achieve multi-dimensional intelligent perception, and bring better product experience for users. Video Number assistant internal test online! Four functions let you send 1g video on the computer