If you shot a burst of images, one for each flash, each flash going off in turn, and then post processed and composited the images together, edge detection would be trival, even for items that are the same color.
If the sensors were greyscale, and you just created a single image that was darker in places where the N images were different, and lighter were they were the same, you would end up with a very detailed and correct line drawing of whatever the pictures was of.
If you shot continuously, as a video stream, while moving the camera around some object, and then post processed the video stream, either offline or realtime, you could generate a very good 3D model of whatever it was looking at, with good texture maps of the surfaces of the object.
For machine vision doing facial recognition, the it will be easier to both recognize faces in general, and also to determine who's specific face is in frame, especially if it's a video stream, which would allow the software put together a higher resolution composite image out of the lower resolution video frames, and to build a 3D model of the face, as the person moves relative to the lens.
This can be built with existing digital camera components today. It doesn't even need a really high megapixel sensor.