Jon Barron
I'm a research scientist at Google DeepMind in San Francisco, where I lead a small team that mostly works on NeRF. At Google I've worked on Glass, Lens Blur, HDR+, VR, Portrait Mode, Portrait Light, Maps, and Shopping. I did my PhD at UC Berkeley, where I was advised by Jitendra Malik. I've received the PAMI Young Researcher Award.
Email / CV / Bio / Scholar / Twitter / Github
profile photoI'm interested in computer vision, deep learning, generative AI, and image processing. Most of my research is about inferring the physical world (shape, motion, color, light, etc) from images, usually with radiance fields. Some papers are highlighted.
By training a latent diffusion model to directly output 3D Gaussians we enable fast (~6 seconds on a single GPU) feed-forward 3D scene generation.
Raytracing constant-density ellipsoids yields more accurate and flexible radiance fields than splatting Gaussians, and still runs in real-time.
An approach for turning a video into a 4D radiance field that can be rendered in real-time. When combined with a text-to-video model, this enables text-to-4D.
Images taken under extreme illumination variation can be made consistent with diffusion, and this enables high-quality 3D reconstruction.
Simulating the world with video models lets you make inconsistent captures consistent.
A slight tweak to the Box-Cox power transform generalizes a variety of curves, losses, kernel functions, probability distributions, bump functions, and neural network activation functions.
A single model built around diffusion and NeRF that does text-to-3D, image-to-3D, and few-view reconstruction, trains in 1 minute, and renders at 60FPS in a browser.
Carefully casting reflection rays lets us synthesize photorealistic specularities in real-world scenes.
A more physically-accurate inverse rendering system based on radiance caching for recovering geometry, materials, and lighting from RGB images of an object or scene.
Neural fields let you recover editable UV mappings for the challenging geometries produced by NeRF-like models.
Applying anti-aliasing to a discrete opacity grid lets you render a hard representation into a soft image, and this enables highly-detailed mesh recovery.
Distilling a Zip-NeRF into a tiled set of MERFs lets you fly through radiance fields on laptops and smartphones at 60 FPS.
Shadows cast by unobserved occluders provide a high-frequency cue for recovering illumination and materials.
Using a multi-image diffusion model as a regularizer lets you recover high-quality radiance fields from just a handful of images.
A class-agnostic inverse rendering solution for turning in-the-wild images of an object into a relightable 3D asset.
Parameter interpolation enables high-quality large-scale scene reconstruction and out-of-core training and rendering.
A survey of recent progress in diffusion models for images, videos, and 3D.
Preconditioning based on camera parameterization helps NeRF and camera extrinsics/intrinsics optimize better together.
Combining mip-NeRF 360 and grid-based models like Instant NGP lets us reduce error rates by 8%–77% and accelerate training by 24x.
Combining DreamBooth (personalized text-to-image) and DreamFusion (text-to-3D) yields high-quality, subject-specific 3D assets with text-driven modifications
We use SDFs to bake a NeRF-like model into a high quality mesh and do real-time view synthesis.
We use volumetric rendering with a sparse 3D feature grid and 2D feature planes to do real-time view synthesis.
Accounting for misalignment due to scene motion or calibration errors improves NeRF reconstruction quality.
We optimize a NeRF from scratch using a pretrained text-to-image diffusion model to do text-to-3D generative modeling.
Training a diffusion model on grid-based NeRFs lets you (conditionally) sample NeRFs.
NeRF lets us synthesize novel orthographic views that work well with pixel-wise algorithms for robotic manipulation.
A joint optimization framework for estimating shape, BRDF, camera pose, and illumination from in-the-wild image collections.
Representing neural fields as a composition of manipulable and interpretable components lets you do things like reason about frequencies and scale.
We denoise images efficiently by predicting spatially-varying kernels at low resolution and using a fast fused op to jointly upsample and apply these kernels at full resolution.
NeRF works better than RGB-D cameras or multi-view stereo when learning object descriptors.
Explicitly modeling reflections in NeRF produces realistic shiny surfaces and accurate surface normals, and lets you edit materials.
mip-NeRF can be extended to produce realistic results on unbounded scenes.
Properly training NeRF on raw camera data enables HDR view synthesis and bokeh, and outperforms multi-image denoising.
Regularizing unseen views during optimization enables view synthesis from as few as 3 input images.
We can do city-scale reconstruction by training multiple NeRFs with millions of images.
Combining NeRF with pose estimation lets you use a monocular video to do free-viewpoint rendering of a human.
Incorporating lidar and explicitly modeling the sky lets you reconstruct urban environments.
Dense depth completion techniques applied to freely-available sparse stereo data can improve NeRF reconstructions in low-data regimes.
Supervising the CLIP embeddings of NeRF renderings lets you to generate 3D objects from text prompts.
A survey of recent progress in neural rendering.
Replacing a costly illumination integral with a simple network query enables more accurate novel view-synthesis and relighting compared to NeRD.
Applying ideas from level set methods to NeRF lets you represent scenes that deform and change shape.
By placing priors on illumination and materials, we can recover NeRF-like models of the intrinsics of a scene from a single multi-image capture.
VAEs can be used to disentangle a font's style from its content, and to generalize to characters that were never observed during training.
NeRF is aliased, but we can anti-alias it by casting cones and prefiltering the positional encoding function.
Baking a trained NeRF into a sparse voxel grid of colors and features lets you render it in real-time in your browser.
Building deformation fields into NeRF lets you capture non-rigid subjects, like people.
With some extra (unlabeled) test-set images, you can build a hypernetwork that calibrates itself at test time to previously-unseen cameras.
Multiplane images can be used to simultaneously deblur dual-pixel images, despite variable defocus due to depth variation in the scene.
A NeRF-like model that can decompose (and mesh) objects with non-Lambertian reflectances, complex geometry, and unknown illumination.
Simulating the optics of a camera's lens lets you train a model that removes lens flare from a single image.
Given an image of an object and a NeRF of that object, you can estimate that object's pose.
By learning how to pay attention to input images at render time, we can amortize inference for view synthesis and reduce error rates by 15%.
Using neural approximations of expensive visibility integrals lets you recover relightable NeRF-like models.
Using meta-learning to find weight initializations for coordinate-based MLPs allows them to converge faster and generalize better.
Letting NeRF reason about occluders and appearance variation produces photorealistic view synthesis using only unstructured internet photos.
Reflections and the things behind them often exhibit parallax, and this lets you remove reflections from stereo pairs.
Embedding a convnet within a predefined texture atlas enables simultaneous view synthesis and relighting.
Scans for light stages are inherently aliased, but we can use learning to super-resolve them.
Composing neural networks with a simple Fourier feature mapping allows them to learn detailed high-frequency functions.
A simple and fast Bayesian algorithm that can be written in ~10 lines of code outperforms or matches giant CNNs on image binarization, and unifies three classic thresholding algorithms.
Extensive experimentation yields a simple optical flow technique that is trained on only unlabeled videos, but still works as well as supervised techniques.
Training a tiny non-convolutional neural network to reproduce a scene using volume rendering achieves photorealistic view synthesis.
Networks can be trained to remove shadows cast on human faces and to soften harsh lighting.
Machine learning can be used to train cameras to autofocus (which is not the same problem as "depth from defocus").
We predict a volume from an input stereo pair that can be used to calculate incident lighting at any 3D point within a scene.
If you want to photograph the sky, it helps to know where the sky is.
By rethinking metering, white balance, and tone mapping, we can take pictures in places too dark for humans to see clearly.
Variational auto-encoders can be used to disentangle a characters style from its content.
Considering the optics of dual-pixel image sensors improves monocular depth estimation techniques.
Training a neural network on light stage scans and environment maps produces an effective relighting method.
A single robust loss function is a superset of many other common robust loss functions, and allows training to automatically adapt the robustness of its own loss.
View extrapolation with multiplane images works better if you reason about disocclusions and disparity sampling frequencies.
We can learn a better denoising model by processing and unprocessing images the same way a camera does.
Frame interpolation techniques can be used to train a network that directly synthesizes linear blur kernels.
By making one camera in a stereo pair hyperspectral we can multiplex dark flash pairs in space instead of time.
Depth cues from camera motion allow for real-time occlusion effects in augmented reality applications.
Dual pixel cameras and semantic segmentation algorithms can be used for shallow depth of field effects.
This system is the basis for "Portrait Mode" on the Google Pixel 2 smartphones
Varying a camera's aperture provides a supervisory signal that can teach a neural network to do monocular depth estimation.
We train a network to predict linear kernels that denoise noisy bursts from cellphone cameras.
A reformulation of the bilateral solver can be implemented efficiently on GPUs and FPGAs.
By training a deep network in bilateral space we can learn a model for high-resolution and real-time image enhancement.
Color space can be aliased, allowing white balance models to be learned and evaluated in the frequency domain. This improves accuracy by 13-20% and speed by 250-3000x.
This technology is used by Google Pixel, Google Photos, and Google Maps.
Using computer vision and a ring of cameras, we can make video for virtual reality headsets that is both stereo and 360°.
This technology is used by Jump.
Mobile phones can take beautiful photographs in low-light or high dynamic range environments by aligning and merging a burst of images.
This technology is used by the Nexus HDR+ feature.
Our solver smooths things better than other filters and faster than other optimization algorithms, and you can backprop through it.
Standard techniques for stereo calibration don't work for cheap mobile cameras.
By integrating an edge-aware filter into a convolutional neural network we can learn an edge-detector while improving semantic segmentation.
By framing white balance as a chroma localization task we can discriminatively learn a color constancy model that beats the state-of-the-art by 40%.
The monocular depth estimates produced by fully convolutional networks can be used to inform intrinsic image estimation.
By embedding a stereo optimization problem in "bilateral-space" we can very quickly solve for an edge-aware depth map, letting us render beautiful depth-of-field effects.
This technology is used by the Google Camera "Lens Blur" feature.
We produce state-of-the-art contours, regions and object candidates, and we compute normalized-cuts eigenvectors 20× faster.
This paper subsumes our CVPR 2014 paper.
Shape, Illumination, and Reflectance from Shading
Jonathan T. Barron, Jitendra Malik
TPAMI, 2015
bibtex / keynote (or powerpoint, PDF) / video / code & data / kudos
We present SIRFS, which can estimate shape, chromatic illumination, reflectance, and shading from a single image of an masked object.
This paper subsumes our CVPR 2011, CVPR 2012, and ECCV 2012 papers.
This paper is subsumed by our journal paper.
We present a technique for efficient per-voxel linear classification, which enables accurate and fast semantic segmentation of volumetric Drosophila imagery.
Our system allows users to create textured 3D models of themselves in arbitrary poses using only a single 3D sensor.
By embedding mixtures of shapes & lights into a soft segmentation of an image, and by leveraging the output of the Kinect, we can extend SIRFS to scenes.
TPAMI Journal version: version / bibtex
Boundary cues (like occlusions and folds) can be used for shape reconstruction, which improves object recognition for humans and computers.
This paper is subsumed by SIRFS.
This paper is subsumed by SIRFS.
We present a large RGB-D dataset of indoor scenes and investigate ways to improve object detection using depth information.
This paper is subsumed by SIRFS.
A model and feature representation that allows for sub-linear coarse-to-fine semantic segmentation.
Markov Decision Problems which lie in a low-dimensional latent space can be decomposed, allowing modified RL algorithms to run orders of magnitude faster in parallel.
Using the relative motions of stars we can accurately estimate the date of origin of historical astronomical images.
We use computer vision techniques to identify and remove diffraction spikes and reflection halos in the USNO-B Catalog.
In use at Astrometry.net
Feel free to steal this website's source code. Do not scrape the HTML from this page itself, as it includes analytics tags that you do not want on your own website — use the github code instead. Also, consider using Leonid Keselman's Jekyll fork of this page.