A new test developed by researchers at Brown and Johns Hopkins University could allow for vast improvements in the ability of computer vision systems to recognize objects. The research team constructed a “visual Turing test” to evaluate how computer vision systems understand images compared to humans, according to the paper published on this work.
The study — published ahead of print March 9 in the journal Proceedings of the National Academy of Sciences — was led by Donald Geman, professor of applied mathematics and statistics at Johns Hopkins and his brother Stuart Geman, professor of applied mathematics at the University.
The test builds upon the work of Alan Turing, a British computer scientist who is widely regarded as the father of the modern computer. The Turing test proposes a method for assessing a machine’s thinking capacity. If a human is unable to distinguish the machine from another human in a “natural language conversation,” then the machine has at least the same level of thinking as a human, according to the study.
The visual Turing test constructed in the study uses a device that generates unpredictable and binary questions from an image. The test is unique in that it focuses on a system’s ability to understand, not just to detect objects.
“Automated systems are not asked to do much compared to the depth of description human beings can give based on a single image,” Donald Geman said. The test is a “way to scale up evaluation,” he added.
The test poses “high-level questions which likely require a computer vision system to know other things like where the objects are, and a rough layout of the scene,” said James Hays, assistant professor of computer science, who was not involved in the study. “These are very hard questions for a computer vision system” to answer, he added.
“The test questions are about the existence of objects, relationships and attributes of the objects,” Donald Geman said. For example, the test might ask the system “Is there a red car?” or “Is there a person in the designated region?”
The test questions are first administered to a human operator, who either provides the correct answer or rejects the question as ambiguous. The test is then given to the computer vision system. The correct answer to a question is provided before the next question is presented. Systems are evaluated by comparing the computer’s answers to those provided by the human operator.
“The key property of the questions that are asked is that they are unpredictable,” said Donald Geman, adding that it is vital that the correct answers to the questions do not provide clues to the answers to subsequent questions.
The query generator — the core of the system — is learned from a database containing thousands of images annotated by humans, Donald Geman said. These annotations include the identification, attributes and relationships of the objects in the images.
Generating such unpredictable queries from these images posed a challenge to the researchers. Due to similarities in the sample images, the researchers said they found it difficult to generate a large number of questions whose answers could distinguish one image from the next. “So after you have asked a certain number of questions, you have very few samples by which to estimate a new question,” Donald Geman said.
The paper demonstrates that it is feasible to generate a Turing test for computer vision systems. “This test raises all sorts of questions,” Hays said, adding that the most immediate question is how well computer vision systems would work when presented with this new benchmark.
Donald Geman said he hopes members of the computer vision community will rise to the challenge that the visual Turing test poses, “even though they may fail miserably.”