A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent
(i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout
nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise the following questions:
We explore simple sensors with resolutions as low as one-by-one pixel, representing a single photoreceptor First, we demonstrate that just a
few photoreceptors can be enough to solve many tasks, such as visual navigation and continuous control, reasonably well, with performance comparable
to that of a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provide
useful information and successfully solve these tasks. To find a well-performing design, we present a computational design optimization algorithm and
evaluate its effectiveness across different tasks and domains, showing promising results. Finally, we perform a human survey to evaluate the
effectiveness of intuitive designs devised manually by humans, showing that the computationally found design is among the best designs in most cases.
In nature, visual sensors are represented by a wide spectrum of different eyes, including different types of lens structure, pupil shape, resolution, size, placement and more. Source: Nilsson 2021.
Some animals have extrimely simple eyes (compared to, e.g., human eyes). See, for example, the qualitative comparison of different animals' eyes resolution on the right. Despite the extrimely low-resolution visual signal from their eyes, rats, flies and even mosquitos exhibit rather complex behaviors. Inspired by these examples, we explore how effective can simple visual sensors can be in solving computer vision tasks. Specifically, we replace a standard high-resolution camera with a few extremely simple photoreceptor sensors with resolution as low as 1x1 (one can also think of it as a camera with resolution 1x1).
In the figure below, we show how the performance of a navigating agent depends on the resolution of its visual sensor. We find that even with an extremely low-resolution sensor, the agent can solve the task reasonably well.
A noteworthy point about biological agents is that they are almost never unimodal. They solve their problems using multiple modalities. Utilizing multimodality, e.g. see here, is an interesting direction parallel to the discussion in this project.
This rich variety of animals eyes' designs (its position and orientation in the context of this work) is believed to emerge as evolutionary adaptations to an animal's specific morphology and the ecology in which it lives and to be important to its effectiveness (see The Ecological Approach to Visual Perception by J.J. Gibson). Strategic placement of even the simplest sensors can enable complex behaviors such as obstacle avoidance, detection of coarse landmarks, and even some forms of predator avoidance. For example, in dragonflies, an acute zone facing upward is hypothesized to allow more efficient prey detection by positioning it against the sky instead of a cluttered foliage background (see video by Real Science). Similarly, some species of surface-feeding fish have eyes with horizontal acute zones that allow them to see prey both above and below the water even while entirely submerged by taking advantage of the refractive index of water through their positioning. For a more unusual example, see Anableps anableps, the four-eyed fish on the left (see video by California Academy of Sciences).
To evaluate the effectiveness of simple photoreceptor sensors, we study agents with (well-designed) PR sensors on navigation (AI Habitat) and dynamic control (DeepMind Control) tasks. We compare the performance of these agents to that of a camera-equipped agent and a blind agent without access to any visual signals.
As seen in the figure on the right, we find that even a handful of well-placed photoreceptors can provide sufficient information to solve some vision tasks with reasonably good performance - significantly higher than a blind agent (scaled to 0) and similar to a more complex camera sensor (scaled to 1).
We consider two navigation tasks in the AI Habitat simulator using scans of real-world scenes from the MatterPort3D dataset. In PointGoalNav, an agent has to navigate to a given target coordinate in a scene. In TargetNav, the agent does not have access to the target coordinate and has to find and navigate to the target sphere which can only be identified visually. As per standard practice, in addition to the visual signal, the agent has access to the GPS+Compass sensor providing its position and orientation w.r.t. its start position.
We study two baselines in comparison to photoreceptor-based agents, an intelligent blind agent and a camera agent. The intelligent blind agent only has access to GPS+Compass sensor while the camera agent has access to visual input of resolution 128 x 128.
PR agents have few obstacle collisions (comparable to a camera agent) and significantly better than an intelligent blind agent. In the video below, it can be seen that with a blind agent encounters a lot of collision (red dots), heavily relies of wall-following and has low efficiency , i.e. obtains a low SPL. The PR agent similar to the camera agent does better long term path planning, avoids obstacles and has high efficiency, i.e. achieves a high SPL.
PR agents efficiently explore the scene and finds the target in TargetNav. The location of the target sphere is random and unknown in the TargetNav task, therefore, an agent must explore the scene in order to find it.
For each agent, we plot 50 trajectories from two episodes of unseen test scenes. The PR agent explores the scene as efficiently as the camera agent and achieves comparable performance.
We evaluate the performance of agents on navigation tasks (PointGoalNav and TargetNav) in the AI Habitat Simulator, where we compare the performance of the PR-based agent to the performance of a camera-based agent (both egocentric) and an intelligent blind agent.
Even a simple visual sensor consisting of 32 PRs (two grids of 4x4 sensors) provides useful information, allowing the corresponding agent to outperform significantly an intelligent blind agent without a visual signal.
The PR agents match the performance of the camera agent using the same shallow 3-layer Transformer encoder as for PRs, while having the visual signal bandwidth of only ≈ 1% of that of the 128x128 camera sensor. When compared to the camera agent using the ResNet-50 encoder, a default choice in the literature (“gold standard”) for a fair comparison, PR agents still perform reasonably well.
DeepMind Control (DMC) suite provides a variety of continuous control tasks, including reaching, manipulation, locomotion, etc. We consider use the following six tasks: Reacher: Hard, Walker: Stand, Walker: Walk, Walker: Run, Finger: Spin, and Finger: Turn Easy.
In these tasks, visual information from photoreceptors is used to sense the agent's state, a role that can also be fulfilled by a camera or nonvisual joint sensors. In our experiments, however, the agents rely solely on visual information.
To evaluate generalization and ensure that the strong performance of photoreceptors is not confined to simulators, we conducted the target navigation experiment (without access to GPS+Compass sensor) in a real-world setting.
We deployed a control policy using 64 PRs (less than 1% of the camera resolution) on a real robot. It demonstrates impressive performance, successfully navigating to the target ball in an unknown room with no real-world training, relying solely on the low-resolution visual signal.
The design of a photoreceptor sensor (or a B x B grid), is varies by changing the extrinsic (position and orientation) and intrinsic (field of view) parameters. The position of a sensor is defined by a 3-dimensional vector in space, constrained to be on the agent's body. The orientation of a sensor is defined by its pitch, yaw and roll, while the field of view can be between 0 and 180 degrees. Therefore, each sensor has an associated 7-dimensional design parameter vector θi = (xi, yi, zi, pitchi, yawi, rolli, fovi).
The design of photoreceptors plays a crucial role in achieving good performance. The performance of an agent with a poor design can drop drastically compared to the best design for the corresponding task as seen below. For each setting (task and number of sensors), the variance in performance of different designs is large.
To find a well-performing design automatically, we develop a computational design optimization method, which tailors designs to the specific agent's morphology, environment, and task. A naive approach would be to use a black-box optimization method by training a design-specific control policy for each new design proposal and selecting the best one. This approach is prohibitively expensive as training a control policy for each new design is costly. Instead, we cast the problem as a joint optimization of the design and control policies. To have a control policy that can adapt to different designs, we train a design-conditional "generalist" policy that is trained to implement control for any given design. This results in a efficient optimization method which allows us to find well-performing designs leading to a close-to-camera performance in most cases.
The design optimization method enhances the performance of a camera agent as well by finding a more effective design than intutive design in both navigation tasks. In PointGoalNav, computational design improves the efficiency metric SPL from 0.447 to 0.518. In TargetNav, it boosts the success rate from 0.363 to 0.405.
Since there is no single obvious way to design simple visual sensors, we conducted a human survey to collect photoreceptor designs based on intuition.
We ask participants to design the design parameters of visual sensors in our defined design space. We ask participants to find the photoreceptors' design parameters for a given task and agent morphology. We collect eight designs for the TargetNav visual navigation task and six designs for the Walker agent in DMC and evaluate them on Walk and Stand tasks.
We find that the best human intuition can provide well-performing designs, and computational design is among the best designs (or the best one) in most cases. We also find a high variance in the performance of different intuitive designs in all settings, signifying the importance of a computational approach to visual design.
While human design is effective in anthropomorphic and intuitive domains, such as placing a camera on a robot, many design areas are not native to humans. Examples include distributed low-resolution eyes and environments unlike typical living spaces, such as medical imaging. These scenarios highlight the need for a computational design framework to overcome anthropomorphic biases.
We designed an experiment to verify that a PR-equipped agent can detect and move towards a target in the TargetNav task. By replacing the target green sphere with a transparent sphere, we observed that, while initially following the same path, the PR agent moves towards the green sphere upon recognizing it, but continues searching when the target is transparent.
Quantitatively, the success rate of the PR agent drops significantly from 0.314 to 0.132 when the green sphere is replaced with a transparent one. This confirms the photoreceptors' target recognition ability.
In this work, we demonstrate that extremely simple visual sensors like photoreceptors can effectively solve vision tasks requiring an understanding of the surrounding world and proprioception. This mirrors examples in nature, where simplicity can lead to intelligent and complex behaviors, suggesting an interesting research direction beyond the trend of training larger models on more data and complex sensors.
We show that design optimization of simple visual sensors is crucial for achieving performance comparable to more complex camera sensors. We propose a computational design optimization method to improve initial designs and find well-performing ones. Inspired by the diversity, simplicity, and specialization of natural vision, this direction prompts us to rethink how computer vision systems can be made more effective yet simplified, especially for unusual and unintuitive application domains (e.g. the Pillcam and the Xenobot). Below, we discuss some limitations of our work.
We would like to thank Jason Toskov for his expertise and efforts in running the real world demonstration of PR agents. We also thank Roman Bachmann for his help with some visualizations.