Researchers from North Carolina State University have developed a methodology that aims to improve vision transformers’ ability to identify objects in images
A vision transformer (ViT) is an AI technology that can identify and categorize objects in images. However, there are significant challenges related to decision-making transparency and computing requirements. Now, researchers have developed a new methodology that addresses both of these challenges.
The method also improves the ability of a vision transformer to identify, classify, and segment objects in images.
What are vision transformers?
Transformers are among the most powerful existing AI technology.
ChatGPT, for example, is an AI technology that uses transformer architecture. In this case, the inputs used to train it are language.
Vision transformers are transformer-based AI that are trained using visual inputs.
This AI technology could be used to detect and categorize objects in an image, such as identifying all of the cars in an image.
Vision transformers face two challenges
The first challenge faced by ViTs is that transformer models are extremely complex. Transformer models require a significant amount of computational power and use a large amount of memory – relative to the amount of data being plugged into the AI.
This is problematic for a vision transformer because images contain a lot of data.
The second challenge is that it is difficult for users to understand how ViTs make decisions.
For example, a vision transformer could be trained to identify dogs in an image. It is not entirely clear, however, how the vision transformer is determining what is a dog or not.
Depending on the application, it is important to understand the vision transformer’s decision-making process – also known as its model interpretability.
The new vision transformer methodology addresses both challenges
The new vision transformer methodology developed by the team is called ‘Patch-to-Cluster attention’ (PaCa) and improves the efficiency of the AI technology.
“We address the challenge related to computational and memory demands by using clustering techniques, which allow the transformer architecture to better identify and focus on objects in an image,” said Tianfu Wu, corresponding author of a paper on the work and an associate professor of electrical and computer engineering at North Carolina State University.
“Clustering is when the AI lumps sections of the image together, based on similarities it finds in the image data. This significantly reduces computational demands on the system. Before clustering, computational demands for a ViT are quadratic. For example, if the system breaks an image down into 100 smaller units, it would need to compare all 100 units to each other – which would be 10,000 complex functions.
“By clustering, we’re able to make this a linear process, where each smaller unit only needs to be compared to a predetermined number of clusters. Let’s say you tell the system to establish ten clusters; that would only be 1,000 complex functions,” Wu said.
The team completed testing of PaCa
The researchers extensively tested the new vision transformer methodology by comparing it to two state-of-the-art ViTs called SWin and PVT.
“We found that PaCa outperformed SWin and PVT in every way,” Wu said.
“PaCa was better at classifying objects in images, better at identifying objects in images, and better at segmentation – essentially outlining the boundaries of objects in images. It was also more efficient, meaning that it was able to perform those tasks more quickly than the other ViTs.
“The next step for us is to scale up PaCa by training on larger, foundational data sets.”