Imagine being able to reach into a photograph and change it to your liking.
That’s the thinking that led Brown researcher Rahul Sajnani GS, a PhD candidate in the Department of Computer Science, to build GeoDiffuser. The geometry-based image editing model optimizes how objects are moved, rotated, translated or removed from within photos.
Developed in partnership with Amazon Robotics, Sajnani and his co-authors won the Best Student Paper award at the Institute of Electrical and Electronics Engineers/Computer Vision Foundation Winter Conference on Applications of Computer Vision.
“Initially, our research was focused on novel view synthesis, taking an image and trying to generate what an object in it would look like from a new angle,” Sajnani said in an interview with The Herald. “But we realized that the idea of applying geometric transformations could extend far beyond that.”
GeoDiffuser works differently from traditional image editing models, which often require fine-tuning on large visual datasets or retraining the model for different tasks. Instead, Sajnani’s method applies optimization in the form of a geometric transformation, which instructs the model how to rotate or move an object in 3D space, then implements that change into the generative model’s attention layers — the focus of the input data. The result is a training-free technique and an image that is faithful to the qualities of the object being transformed.
“It’s like being a camera operator on a movie set,” Sajnani explained. “You don’t change the scene itself, you’re just choosing where to look.”
GeoDiffuser builds on existing diffusion models, such as the ones powering DALL-E or Stable Diffusion.
The model runs two versions of the image generation in parallel: One recreates the original image, and the other makes the desired edit. The “shared attention” mechanism links the two, so that the model can keep the background and unedited parts consistent between the original and edited image.
This link comes from a method Sajnani described as “injecting geometry.” GeoDiffuser uses depth maps, images that encode the distance of each pixel in a scene from a fixed reference point, as well as transformation matrices to tell the model where to send each pixel, Sajnani explained. These transformations are slipped into the model’s attention layers and guide edits without changing the model’s core weights, or the most important parameters of the image determined by the input.
First, the model is given an instruction — such as “move this car to the right” or “rotate this dog to face left” — which leads it to compute the corresponding geometric transformation and feed that information into the diffusion process. The model then iteratively refines the image through loss functions that help to reduce the noise.
GeoDiffuser’s design enables the model to efficiently preserve the identity of an object. In one experiment, GeoDiffuser removed a boat sitting on a lake and erased its shadow and reflection on the water automatically.
For Sajnani, the most rewarding part of the project wasn’t just optimizing a tool: It was also about understanding systems and how they operate.
“When you can interpret what a model is doing and then manipulate it to do even more than it was trained for, that’s what excites me,” he said. “That’s the curiosity I want to keep following.
Due to COVID-19, WACV 2025 in Tucson, Arizona was Sajnani’s first in-person conference.
“You read all the papers by people you admire and then suddenly you’re standing in front of them, giving a talk about your own work,” he said. “It was surreal.”
Sajnani hopes to develop new models in the future that could aid perception and training data generation for robotic movement. But, according to Sajnani, the difficult challenge isn’t in moving pixels around but in figuring out where and how to apply transformations within the labyrinth-like layers of a generative model.
“Most of these models don’t have geometry baked in,” he said. “And we don’t fully understand what each layer is doing.”
According to Sajnani, the model can still be improved. GeoDiffuser works best with modest changes on objects — like 45 to 60-degree rotations, translations and removals. Extreme rotations, like turning a human 180 degrees, remain an open challenge, he explained.
Sajnani added that while today’s image editing software might let you nudge a lamp or spin a parked car, tomorrow, it could edit entire 3D environments on the fly.