Abstract:
Crossmodal image-text retrieval involves retrieving relevant images or texts based on a query condition from the opposite modality. Its primary challenge lies in precisely quantifying the similarity metric used for feature matching between the two distinct modalities, playing an important role in mitigating the visual-semantic disparities between the heterogeneous realms of visual and linguistic domains. It has extensive applications in domains such as e-commerce product search and medical image retrieval. Traditional retrieval paradigms depend on harnessing deep learning techniques for extracting feature representations from images and texts. Crossmodal image-text retrieval learns semantic feature representations of disparate modal data by harnessing the formidable feature–extraction ability, subsequently mapping them into a shared semantic space for semantic alignment. However, this approach primarily depends on superficial data correlations, lacking the capacity to reveal the latent causal relationships underpinning the data. Moreover, owing to the inherent “black-box” nature of deep learning, the interpretability of model predictions often eludes human comprehension. In addition, an undue reliance on training data distributions impairs the generalization performance of the model. Consequently, the existing methods suffer the challenge of representing high-level semantic insights while maintaining interpretability. Causal inference, which endeavors to ascertain the causal effect of specific phenomena by isolating confounding factors by means of intervention, presents a novel avenue for enhancing the generalization capability and interpretability of deep models. Recently, researchers have sought to combine visual and linguistic tasks with the principles of causal inference. Accordingly, we introduce causal inference and embeds consensus knowledge into the bedrock of deep learning, and a novel causal image-text retrieval methodology with embedded consensus knowledge is proposed. Specifically, causal intervention is introduced into the visual feature extraction module, replacing correlated relationships with causal counterparts to cultivate common causal visual features. These features are then fused with the primal visual features acquired through bottom-up attention, resulting in a definitive visual feature representation. This study adopts the potent textual feature extraction ability of bidirectional encoder representations from transformers to address the shortfall in textual feature representation. Shared consensus knowledge between the two modal data is entwined, allowing for consensus-level feature representation learning image-text features. Empirical validation on the dataset MS-COCO and crossdataset experiments on the dataset Flickr30k substantiate the capacity of the proposed method to consistently enhance recall and mean recall in bidirectional image-text retrieval tasks. In summary, this pioneering approach endeavors to bridge the gap between visual and textual representations by combining causal inference principles and shared consensus knowledge within the framework of deep learning, thereby promising enhanced generalization and interpretability.