2010-02-18

Idea: parallax, 3D streaming video compression, and webcams

If your computer has two or more webcams pointing at you from opposite corners of the display, then it can do parallax calculations, just like your eyes and brain do, and by doing so, construct a decently good depth map 3D model of what it was seeing.  Most of the time what it would see would be your  face  in the near area, and whatever wall was behind you in the far area.

For sending streaming video for video conferencing, the simple and dumb thing to do would be to send a complete video stream from each camera.  This is dumb and wasteful.  The way that video compression works is that it looks for similarities between frames over time, and then sends a stream of transformations and differences.  An obvious technique for sending parallax video streams is for the video compressor to look at the multiple simultaneous frames, as well as over time, and send transformations and differences over space as well as over time.  I do not know if any of the MPEG or other video stream compression standards can do this, but it will soon be necessary, especially now that the video entertainment industry is seriously talking again about 3D.

I'm sure there are a whole slew of obvious patents all claiming assorted variations of this simple idea

A particularly simple and stupid way to do it would be to just feed the alternating frames, left right left right, into the video compressor, and pretend that flicking back and forth in space is the same as motion through time.  While I'm sure that some useful level of compression would result, it would be extremely, shall we say, non-optimal.

Anyway, the math for doing the parallax calculation and the math for doing the video compression transformation and differences would be very similar.  A smart implementation could be spitting out a rough Z depth map as well as the compressed data stream.  In fact, it would make sense for the Z depth map to be sent as part of the compressed video data stream.

Doing all this with more than two cameras makes calculating the depth map even easier, plus each camera can have cheaper lower resolution sensors,  which can be mathematically combined into a much higher resolution view.

Another useful trick for the webcam case would be to use the depth map to distinguish between the stuff in the near field (which probably is the face of a talking head) and the random background clutter, and either send the background video data at a much lower resolution, or maybe not even send it at all.  This would save bandwidth, increase privacy, and improve the user experience.  After all, when you are on a video chat call with someone, you usually dont care at all what is behind them.  It carries no useful signal for the conversation.

This same trick could be doable even without two cameras and parallax.  Facial recognition software is good enough that it could draw a good enough bounding curve around your face on a frame by frame basis, before sending the raw frames into the compressor.  This might even be a CPU time win, because the compressor doesnt have to spend any time at looking at and compressing the background image.  (After talking to a friend, and doing some google searches, it turns this sort of thing is already an available feature in off-the-shelf consumer grade webcams.)

Mixing together the idea of the depth map and facial recognition software has many other interesting implications.

Faces all have the same basic depth map.  The webcam video stream could basically say "this is a face.  here are the transformations that turn a generic face into THIS face.  here are the stream of transformations over time that are THIS face changing over time".  Then interleaved with that, is a color image map of your "skinned" face, that the receiving side can then "wrap" over the 3D depth map.  And it can do such tricks as prioritize and do in higher resolution (in both space and time) the face's eyes, lips, and jaw, and maybe even do some work to sync the mouth motions with the audio data.

This makes other neat tricks possible as well.  By distorting the face model and the unwrapped color image in various ways, you could make yourself look thinner, fatter, a different gender, a different race, a fictional face (elf, dwarf, na'avi, etc) or even look like some other specific person.  This would be useful for amusement, enertainment, deception, privacy, and corporate branding.

Imagine calling up a corporate customer service rep on a video call, and no matter who you actually talk to, they all look like the same carefully branded innoffensive person.  Or even possibly based on your customer record and call data geolocation data, the CSR face looks like the race/ethnicity of your local area.

Another useful shortcut would be that once someone's specific face specification is generated, it can be given a specific unique id.  At some time in the future, in another video call, just the ID could be sent, and if the receiver has seen it before, the full face definition does not have to be sent.   This would work for more things than faces.  Whenever ANY definable object is seen and characterized, it gets an ID, and is opportunistically cached by the receiver.

All this also can work with the other implications of linking facial recognition with augumented reality and lifestreaming that I've mused about in a previous post.

3 comments:

  1. Consider Pham Nuwen: bristly red hair, smoke-gray skin, singsong voice. If cues such as those were sent, like as not the display at Fleet Central would show something quite different from the human Kjet saw. "... wait a minute. That's not how evocations work. I'm sure they got a pretty clear
    view of you. See, a few high-resolution pics would get sent at the beginning of the session. Then those would be used as the base for the animation."

    Pham stared back lumpishly, almost as though he didn't buy it and was daring Kjet to think things through. Well damn it, the explanation was correct; there was no doubt that Limmende and Skrits had seen the redhead as a human. Yet there was something here that bothered Kjet ... Limmende and Skrits had both looked out of date.

    "Glimfrelle! Check the raw stream we got from Central. Did they send us any sync pictures?"

    It took Glimfrelle only seconds. He whistled a sharp tone of surprise. "No, Boss. And since it was all properly encrypted, our end just made do with old ad animation." He said something to Tirolle, and the two twittered rapidly. "Nothing seems to work down here. Maybe this is just another bug." But Glimfrelle didn't sound very confident of the assertion.

    -- "A Fire Upon the Deep", Vernor Vinge.

    ReplyDelete
  2. I'm a research assistant at a lab at the University of Virginia and one of the things that we are studying is parallax in web conferences; however, we take at the challenge with a rather unorthodox approach. By taking a few frames of footage from a brief face-callibration video, we create 3D models of individual's faces and run these avatars in conversation in replacement for "actual" video feeds. What is transmitted in conversation are the mathematical motions of the individuals' faces in conversation for the avatars to mimic, and the result is a conversation that feels more real and less choppy than the standard parallaxless video conversations that are currently being used by the mainstream.

    Here's a brief video demonstration on our website: http://faculty.virginia.edu/humandynamicslab/projects/video-conferencing/

    ReplyDelete
  3. ah, had not read the second half of your post, but yes, that's the gist of what we're doing.

    ReplyDelete