-
2022-10-14 10:29:59
How did the black technology of Pixel camera come from? Let's see the technical decryption
Even though multi-lens mobile phones are now in vogue, Google still shows the strength beyond single lens on Pixel mobile phones, bringing several features such as HDR+, depth-of-field portrait mode, Super Res Zoom and night vision with hardware + software + AI. The project can only be completed with multiple shots. I believe you are also curious about what technologies are used behind it? Google has specially invited outstanding engineer Marc Levoy to come to Taiwan to decipher them one by one for everyone. Some have been published on the Google AI Blog before, some The above mentioned some concepts. The previous interviews and articles on the Google AI Blog are also combined here to organize and share them with you. .
Macr Levoy is a professor emeritus of computer science at Stanford University (a position supported by the founder of VMware) and a distinguished engineer at Google. Macr Levoy teaches computer graphics, digital photography, and art science at Stanford University, and he currently leads a team at Google responsible for the Pixel phone's HDR+ mode, portrait mode, night vision, Jump light field camera, and the Halide image processing language program. There is considerable research in the field of Computational Photography.
Macr Levoy first shared with us the trend of mobile phone cameras BYT60P-1000 , the pixel/sensing element and lens quality on the hardware are getting better and better, the aperture is getting bigger and bigger, and the multi-lens trend that has become the mainstream, but most people may not have observed it Now, the focus of mobile photography is shifting from hardware to software.
Including the use of computational photography to integrate the number of continuous shooting frames for multi-frame synthesis, and importing machine learning during computational imaging, and in the process of training machine learning, more information that can be put into training is obtained. quality.
Macr Levoy also mentioned that a good phone camera program should abide by several principles:
Execute quickly:
For example, the refresh rate of the viewfinder screen should be higher than 15fps, the shutter delay should be lower than 150ms, and the photo imaging time should be less than 5 seconds after taking a photo, and the phone should not feel hot during the process.
The preset mode must never fail:
The process must be reliable, such as reliable exposure, focus, automatic white balance, and avoid ghosting or afterimages.
Pay attention to the special situations that users will encounter when taking pictures:
Of course, it is possible to encounter difficult scenes, and the preset modes may not shoot well or detect errors. Macr Levoy, for example, they have a set of face detection, with an accuracy rate of 96%, but sometimes people who wear dark sunglasses will detect inaccurate, resulting in inaccurate metering parameters. Later, they used machine learning and better Plot data to fix this problem.
However, in the special mode, if there are occasional humorous accidents, the result is acceptable. For example, when a dog is walking around when shooting a panoramic view, this kind of dachshund may be photographed...
Because of the small photosensitive element, and most of the mobile phones are hand-held, they face several major problems: low-light noise noise, hand-held easy to shake, small dynamic range, etc., so it can be found that many algorithms are solving these problems. .
There are four functions brought by computational photography on the Pixel phone, which may be a bit brain-burning, so it is recommended to click the jump directly, take a break after reading one, don't watch it continuously, it will not be tired...
HDR+
Portrait mode
High resolution zoom
night vision mode
HDR+
Now everyone should be very clear about the principle of HDR. The typical method is to shoot multiple frames of different brightness images with bracketing exposure. The highlights of the short exposure are clearer, and the dark parts of the long exposure are clear, and then these parts with details are taken. , synthesizing the final image.
But its problem is that if it is shot by hand, it may be difficult to align the pixels when compositing because of shaking or movement of the object.
Moreover, if there is overexposure to a white area in the picture, the calculation will be wrong, and the final picture will have problems such as ghosting, ghosting, and artifacts.
Google's solution is to shoot multiple images with short exposures in a row, and try to make each exposure the same as the first one. These images have details in the highlights, and because the exposures are similar, it is easier to align when superimposed.
But you may want to say: if you shoot a short exposure, isn't the dark part too dark?
The approach of HDR+ is to use tone mapping (Tonemap) to enhance the dark parts, reduce the bright parts, and sacrifice a bit of the overall tone and contrast to preserve the local contrast, so that the bright and dark parts of the picture can show details. (Left: Single Shot Right: HDR+)
Wouldn't the enhanced dark part see a lot of noise?
There is a formula that the signal-to-noise ratio (SNR) is proportional to the square root of the number of continuous shooting frames. The more continuous shooting frames, the higher the signal-to-noise ratio, which means the less noise. After multiple continuous shooting, there are enough pictures for synthesis. , you can reduce the noise in the dark part.
Also because multiple images are superimposed for noise reduction, the low-light image quality will be better. (Left: Single Shot Right: HDR+)
Portrait mode
Now the portrait mode on mobile phones is a "synthetic depth of field" that is calculated.
Unlike a single eye, the aperture of a mobile phone is fixed, and the focal length of the lens is wide-angle. Unless you take a close-up shot, it is difficult to generate depth of field. However, if you know the distance between the camera and each point on the screen, you have the basic conditions to distinguish the foreground and background. Then through the calculation, the background is blurred to achieve "synthetic depth of field".
So how does the mobile phone measure the distance of each point in the scene?
The most common method is to use two lenses to capture two images with similar viewing angles under similar focal points. After finding the pixels that match the two images, the depth is calculated by triangulation. This is the so-called "stereo algorithm". , and then keep a point on a plane clear, and a point outside the plane, do a blur effect, this blur effect collects the average value of the color of adjacent pixels, and the degree depends on the distance between the focus of this point and the plane, the software can even Controlling the shape of the bokeh, although calculated, can be done to make it look the same to most people as an optical depth shot.
This is a relatively simple concept for mobile phones to generate depth-of-field photos, but the Pixel series only have a single lens. How to measure the distance/depth and calculate the depth of field? Here is a little more detail about the practices on Pixel 2 and Pixel 3. Basically, there are four step.
Step 1: Take an HDR+ photo first
Portrait mode photos are based on a pan-focus, clear HDR+ image. As mentioned earlier, HDR+ has the advantages of high dynamic range, low noise, and clear details. Once you have an HDR+ photo, you can go to the next step.
Step 2: Distinguish the foreground and background
Once you have an HDR+ photo, it's time to decide which pixels are foreground and which are background, using machine learning here.
Take, for example, the person in the foreground holding a coffee mug.
Google trained a set of "Convolutional Neural Networks" (CNN) written in TensorFlow to calculate which pixels are human and which are not. It first distinguishes simple features such as color and edge in stages, and then screens faces and bodies in the later stages. The process of combining the two stages is important, because it is not only necessary to measure whether there is a person in the picture, but also to define exactly which pixel also belongs to this person.
Google used more than a million photos of people and accessories to train machine learning models, including wearing hats, eating cones, etc. This neural network system recognizes the teacup in the hand is also a prospect.
This is after the neural network recognizes the characters and background masks, and uses the "Edge-aware Bilateral Solver" in mathematical calculations to make the edges of the character segmentation sharper.
Step 3: Calculate the depth map
As mentioned earlier, the Pixel phone uses machine learning to distinguish what is a character and what is the background, but what if it is an object?
Non-persons cannot be identified by machine learning. Google uses Dual-Pixel dual-pixel technology (ie PDAF phase focus) to estimate the depth map. First, it solves the single-lens problem of Pixel phones. Second, it also uses this to distinguish which objects are the foreground and which are the background.
PDAF pixels can provide information in a single snapshot through the pixels on the left and right (or the top and bottom of the lens if you're holding the phone straight), through a very small baseline of about 1mm, Make the image three-dimensional and provide clues to judge the depth of the scene.
Step 4: Combine the above segmentation map and depth map to render the finished product
Finally, the PDAF image is input into the trained convolutional neural network, combined with machine learning segmentation technology, to keep the subject clear and blur proportionally according to the distance between the subject and the background.
Professional cameras will produce progressive bokeh depending on the distance from the focus plane, the reality is that the nose may be blurred when the eye is in focus. The distance calculated by machine learning will keep the subject consistent and clear, which may not conform to the physical reality, but it can make it easier for ordinary people to take photos with natural bokeh.
It seems an understatement to train a neural network that recognizes PDAF images, but the training process is not simple, and Google has optimized the Pixel 3 training model.
To train a machine to recognize the depth of PDAF images, it is necessary to provide a large number of PDAF images. Google used a set of five Pixel 3 machines (Five Eyes?), controlled them with WiFi, and could shoot synchronously, taking thousands of pictures. After the photos are taken, the Dual-Pixel technology is trained to produce better quality depth maps through Ground Truth in machine learning.
For example, some horizontal lines that were originally parallel to the baseline will be ignored when calculating the depth because it is difficult to detect the difference in distance, but not blurred, which has been improved in the new Pixel 3 model and algorithm.
In order to allow users to quickly complete calculations without waiting too long when shooting, Google used TensorFlow Lite, a model for mobile and embedded devices, and the Pixel 3's more powerful GPU to work together to make the model run quickly.
But the front camera doesn't have a Dual Pixel, so why can it still take depth-of-field photos?
The front camera does not use the Dual Pixel to calculate the depth map, but uses machine learning to identify people and backgrounds. (It can be found that there is no depth of field when shooting non-human objects with the front camera + portrait mode)
High resolution zoom
Google proposed the "Super Res Zoom" high-resolution zoom feature on the Pixel 3. When everyone is equipped with an extra telephoto lens to take clear zoom, Google still wants to try to use a lens with computational photography technology to do the same thing.
In the end, the high-resolution zoom function formed on the Pixel 3, although there is no artificial intelligence or machine learning technology behind it, but a multi-layered effect to synthesize high-resolution photos to improve the details of the photos, but the result is different from many cameras equipped with 2x optical zoom lenses. Cell phones are comparable.
The digital zoom generated by enlarging a certain part of the image will definitely lose details. Therefore, how to reconstruct good and clear details is the most important thing for high-resolution zoom to overcome.
Students should have heard that the current photosensitive element color filter is such a Bayer filter array, not each pixel has three primary colors, but a group of 2x2 pixels, these two pixels It's green, that's red, that's blue, arranged.
In this way, the pixels (missing information) in the question mark part in the picture below, what color and details are actually reconstructed, this process is called "Demosaicing" In the process of missing color, Macr Levoy mentioned to us that there may be 2/3 of the information in an image, which is reconstructed through interpolation.
Therefore, if the information on the pixels is not complete, the reconstructed quality will not be very good. The photosensitive element of the mobile phone is small, and the details of the pixel record are less than that of the monocular. If the pixels after interpolation are used again, the interpolation is calculated again. , the quality will be worse, and the details will be less clear.
At this time, multiple synthesis is a kind of God's rescue technology.
As mentioned earlier, HDR+ is to use multiple composites to produce a photo with good pixel details in the bright and dark parts. High-resolution zoom also uses the same principle.
When we take pictures by hand, even if you don't notice it, there will still be some slight shaking, and there will be slight differences in the position of each picture in the continuous shooting. These differences in different positions can just fill the missing pixels, so you don't need to Perform demosaicing, using guesses to reconstruct missings.
The ideal situation is to think like this, four pictures that are offset by one pixel up, down, left and right, and just fill in the missing information after synthesis.
But the reality is that most mobile phones are handheld cameras. When shooting high-resolution zoom, the small shaking of the handheld can be converted into the advantage of making up for the missing pixels (there is no way to compensate for the large shaking beyond the OIS). During the process, the continuous shots should be aligned with one of them as a reference. After alignment, a rough image will be formed.
However, the hand shake is usually not offset so well. It is still a bit of a challenge to use this single-lens high-resolution zoom solution on mobile phones, and special algorithms need to be developed.
For example, each single shot of a continuous shot will still contain noise, and the algorithm needs to identify it and perform noise reduction correctly. In addition, moving objects in the scene (leaves blown by the wind, people walking around, etc.) will also affect the quality of image alignment.
In order to effectively merge the continuous shooting pictures, Google developed a continuous shooting resolution enhancement mode on the Pixel 3 to notice the part of the picture that belongs to the "edge", and fuse the pixels along the direction of the edge, so that the photos will be caused by noise reduction. The picture loses sharpness, and there is a trade-off between increasing detail and resolution.
night vision mode
The night vision mode was originally designed to improve the quality of hand-held photos without flash at a brightness of 0.3 lux to 3 lux. 3 lux is probably only a sidewalk illuminated by street lights at night, and 0.3 lux means that the lights are not turned on, and the ground can no longer be found. The brightness of the key is gone.
We know that when shooting handheld in low light, the longer the shutter time, the better the light can be captured, but the more hand shake, and the Pixel phone also uses multiple composites to solve these two contradictions.
The handheld + night vision mode uses the aforementioned HDR+ technology, as well as the Pixel 3 plus high-resolution zoom technology to improve sharpness and align/blend pixels.
In response to the shaking problem, the Pixel phone will also be equipped with mobile measurement technology to see if it is handheld or on a tripod. If the phone is detected to be stable, the number of shots will be less, but the exposure time for each shot will be longer. If there is something moving in the picture, the number of shots is large, and the exposure time for each shot is shortened. According to the article on the Google AI Blog, how many shots and how long the exposure time will depend on different models of Pixel phones, as well as hand-held and mobile scenes, and may be between 1/15 second (or shorter) of the shutter. 15 shots, and 6 shots in a 1-second burst.
However, the night vision mode also encounters problems with white balance and color distortion in low light.
Improve white balance in low light
People can still perceive color correctly when looking at things in colored lighting, or wearing sunglasses, but when we take pictures in one light and view them in a different light, the pictures often appear to be dyed , it is difficult for us to judge the actual color of the object. In order to correct this situation, the camera usually uses automatic white balance to correct, that is, partial or global correction of color temperature, so that the color looks like it is under natural light (usually white light) lighting presented.
While auto white balance does a decent job in non-night vision modes, it can be difficult to detect what color the light source is in very dim or strongly colored lighting.
In order to solve this problem, Google developed a "learning white balance", which trains the model to distinguish between good white balance and bad white balance. When we encounter bad white balance when taking pictures, the algorithm will suggest how to adjust it to be more natural. 's hue.
To train the algorithm, you shoot various scenes with a Pixel phone, then manually correct the white balance when viewing the photos on a color-calibrated screen. The difference between learned white balance correction in low light can be seen in the following example:
The left is taken in Pixel 3+ preset mode, the right is taken in Pixel 3+ night vision mode. In default mode, the camera does not know how yellow the light source is in this seaside cabin, and the right is a learned white balance calculation. After that, restore a more natural white balance.
Correct expression of nocturnal tone
The night vision mode mentioned earlier is to improve the quality of hand-held photos without flash at a brightness of 0.3 lux to 3 lux.
But in very low light, the human eye can't see the colors in the scene, how can the camera capture and present it beautifully?
When shooting low-light night scenes, whether it is a single eye or a mobile phone, most of them will use long exposures or multiple composites to take such photos with the right colors, details, and shadows caused by moonlight, and there are stars. The effect is good, but it looks too daytime and can cause confusion for readers who don't think it's meant to express a night scene, or maybe it's not what the photographer wanted.
Google took inspiration from classical paintings, and they found that the artist used three methods to enhance the contrast, surround the scene with dark colors, and paint the shadows as black to represent the low-light scene. Therefore, in the algorithm of night vision mode, the same technique is also used to express low-light scenes. However, it is still a bit difficult to use the night vision mode to capture the brightness and details, and at the same time let the viewers know that the photo is a night scene. This is a pretty good result.
Will a good thing like night vision mode be opened so that other mobile phones can also be installed and downloaded?
Regarding this issue, Macr Levoy mentioned that the priority is still on Pixel phones at present. Besides technology, there are many complicated factors in whether to open or not. He cannot comment on whether there will be plans in the future.
Some QA on Computational Photography
The above computational photography techniques are still applied to still photos. Macr Levoy said that if these software technologies are to be used in movies, better hardware acceleration is needed to support them. In the future, they can still be considered in movies. (However, the anti-shake algorithm is useful for the Pixel 3's video recording)
At present, computational photography still faces some limitations. In terms of hardware, it is also faced by every manufacturer. For example, if you want a longer focal length, the body will be thicker. Although manufacturers have many innovative technologies, they cannot avoid encountering to these physical limitations. In terms of software, there are also computing power and memory ceilings that need to be broken through. For example, more advanced machine learning models require larger memory...etc.
Macr Levoy also mentioned that although the computation of computational photography has become more complex, it also means that they can obtain better shooting functions (or experience) through repeated updates, and they also hope that each update can expand As with previous Pixel phones, it's been an ongoing effort for the team, but the advantage is that more advanced features can be rolled out without having to wait for new hardware to come out.
Even with a single-lens camera, there are more or less compositing and software calculations in operation, but it does not seem to be as comprehensive and in-depth as modern mobile phones.
For the application of many algorithms during shooting, some people think that the real shooting behavior (recording images) is the moment when the lens and the sensor record should not be modified too much;
Some people also think that it is necessary to restore the image that the eyes can see through the software, but the lens cannot record it well (for example, many photographers will still retouch the picture and then take out the finished product),
Some people also think that using AI calculation to make the picture more beautiful, stable and pleasing, and to solve the limitations of mobile phone hardware, it is also possible.
Not only you and me, but every manufacturer also has different preferences and tastes for computational photography. The algorithm needs to interfere more with imaging. I don’t think there is an answer. What do you think?