In a series of articles, iVinci publishes the technical challenges developing the iVinci 1 - the observation robot for capturing image, speech, action and trigger data. All data being enriched with machine learning and captured in a single collected data stream for analysis and reporting purposes.
In this episode we explain how we capture and enrich images of the participants in a group of people. We use front and side camera’s and are able to identify, track and trace individuals throughout the room with multiple camera’s.
First challenge is how to separate various people in a group?
To start with, each participant will introduce themselves to the system. With this 'frontal' image that person can be ‘followed’. Independent on which camera is used and how many people are available in the image.
In a future version we will record the vocal characteristics of each participant so that we can apply diarisation (‘which voice belongs to what person’) in a later phase. In this way we can also differentiate people on the basis of their own voice.
Data flow per participant - mux-ing
Now we have a solution for recording images and speech, we have to combine the various data flows for up to 6 different video and audio streams. Preferably in such way that we can analyze the data and trace back in time what happened when.
When tracking multiple participants, there are many data flows to be recorded. Think of speech, still pictures, moving pictures and data what people do. Sometimes we can use these data in a combined format in order to analyze in detail what is happening (e.g. when people speak and how they look). We will use the separate streams from various recording units to analyse and enrich their data. In such way the most time-consuming stream will not hold up the other data flows.
iVinci has developed a uniform data structure that can be used for image, speech and triggers or sensors. Where combination is required, multiple streams are formed into a single one stream with mux-tags for identifying the source using a multiplexer. We call this 'mux-ing'. This allows the identification component to combine voice and image to find and capture an introduction moment. Mux-ing is a solution for more than only incoming data. If outgoing text and image are provided, mux-ing can be a way to present two different participants in the correct synchronized way.
An example of using the mux-er is to capture several video inputs and convert them into one single data stream using SIFT (SIFT is an algorithm to detect and describe images).
In this way, the components behind the mux-er have access to each single image, where the components before the mux-er only have access to the images from the camera where they came from. This is particularly important for computer-intensive components such as the SIFT-er. The three SIFT-ers rotate in parallel in the picture as shown above, so that one video stream does not have to wait for the other.
Without mux, the picture would have looked like this. Not only does the SIFT process have to process all the images after each other, they may come in faster than the SIFT-er can calculate them. Also the vcap component has to be able to read non-blocking of cameras in a round-robin way that may not be tuned to the different speeds at which the cameras offer their images.
In general, with multiple observation inputs, it is important to insert the mux as late as possible. Preferably only when the combination of data is needed. As late as possible means that there is more parallel and that differences in delivery speed do not cause the slowest flow to dictate the pace.
Now we have given a sneak view into the technicalities of capturing, enriching and mixing various data flows, we will focus on order issues, pipe-lining and data streaming.
Supported by …
This innovation project is supported by the province of Utrecht, which has allocated a subsidy to some of the most promising innovations. After this summer, we will bring more information about the finished product, including the request to participate in various tests.
From mid-July we will publish a series of articles to share technical challenges we have encountered. We will pay attention to image, voice, trigger data and the correct combination of all data.
The next episode is about order issues, pipe-lining and data streaming.
Who we are…
At iVinci, we use open source components and like to share what we have discovered. At iVinci we work with the latest technology for audio and image processing. We develop techniques to be used effectively for natural human conversations.
For more information or a specific question, feel free to leave a message.