If your computer has two or more webcams pointing at you from opposite corners of the display, then it can do parallax calculations, just like your eyes and brain do, and by doing so, construct a decently good depth map 3D model of what it was seeing. Most of the time what it would see would be your face in the near area, and whatever wall was behind you in the far area.
For sending streaming video for video conferencing, the simple and dumb thing to do would be to send a complete video stream from each camera. This is dumb and wasteful. The way that video compression works is that it looks for similarities between frames over time, and then sends a stream of transformations and differences. An obvious technique for sending parallax video streams is for the video compressor to look at the multiple simultaneous frames, as well as over time, and send transformations and differences over space as well as over time. I do not know if any of the MPEG or other video stream compression standards can do this, but it will soon be necessary, especially now that the video entertainment industry is seriously talking again about 3D.
I'm sure there are a whole slew of obvious patents all claiming assorted variations of this simple idea
A particularly simple and stupid way to do it would be to just feed the alternating frames, left right left right, into the video compressor, and pretend that flicking back and forth in space is the same as motion through time. While I'm sure that some useful level of compression would result, it would be extremely, shall we say, non-optimal.
Anyway, the math for doing the parallax calculation and the math for doing the video compression transformation and differences would be very similar. A smart implementation could be spitting out a rough Z depth map as well as the compressed data stream. In fact, it would make sense for the Z depth map to be sent as part of the compressed video data stream.
Doing all this with more than two cameras makes calculating the depth map even easier, plus each camera can have cheaper lower resolution sensors, which can be mathematically combined into a much higher resolution view.
Another useful trick for the webcam case would be to use the depth map to distinguish between the stuff in the near field (which probably is the face of a talking head) and the random background clutter, and either send the background video data at a much lower resolution, or maybe not even send it at all. This would save bandwidth, increase privacy, and improve the user experience. After all, when you are on a video chat call with someone, you usually dont care at all what is behind them. It carries no useful signal for the conversation.
This same trick could be doable even without two cameras and parallax. Facial recognition software is good enough that it could draw a good enough bounding curve around your face on a frame by frame basis, before sending the raw frames into the compressor. This might even be a CPU time win, because the compressor doesnt have to spend any time at looking at and compressing the background image. (After talking to a friend, and doing some google searches, it turns this sort of thing is already an available feature in off-the-shelf consumer grade webcams.)
Mixing together the idea of the depth map and facial recognition software has many other interesting implications.
Faces all have the same basic depth map. The webcam video stream could basically say "this is a face. here are the transformations that turn a generic face into THIS face. here are the stream of transformations over time that are THIS face changing over time". Then interleaved with that, is a color image map of your "skinned" face, that the receiving side can then "wrap" over the 3D depth map. And it can do such tricks as prioritize and do in higher resolution (in both space and time) the face's eyes, lips, and jaw, and maybe even do some work to sync the mouth motions with the audio data.
This makes other neat tricks possible as well. By distorting the face model and the unwrapped color image in various ways, you could make yourself look thinner, fatter, a different gender, a different race, a fictional face (elf, dwarf, na'avi, etc) or even look like some other specific person. This would be useful for amusement, enertainment, deception, privacy, and corporate branding.
Imagine calling up a corporate customer service rep on a video call, and no matter who you actually talk to, they all look like the same carefully branded innoffensive person. Or even possibly based on your customer record and call data geolocation data, the CSR face looks like the race/ethnicity of your local area.
Another useful shortcut would be that once someone's specific face specification is generated, it can be given a specific unique id. At some time in the future, in another video call, just the ID could be sent, and if the receiver has seen it before, the full face definition does not have to be sent. This would work for more things than faces. Whenever ANY definable object is seen and characterized, it gets an ID, and is opportunistically cached by the receiver.
All this also can work with the other implications of linking facial recognition with augumented reality and lifestreaming that I've mused about in a previous post.