Privacy-Preserving Semantic Document Retrieval: A Survey on SBERT, Federated Learning, and Homomorphic Encryption
Keywords:
Privacy-preserving retrieval, Semantic embeddings, SBERT, Federated learning, Homomorphic encryption, Searchable encryptionAbstract
The rapid growth of digital information has intensified the demand for secure and efficient document retrieval systems, particularly in domains such as healthcare, law, and finance, where data sensitivity is paramount. Traditional keyword-based methods like TF-IDF and BM25 provide effective baseline retrieval but fail to capture semantic meaning. Advances in deep learning, particularly through BERT and its extension SBERT, have enabled semantic embeddings that significantly improve contextual relevance. However, deploying such models in privacy-sensitive environments introduces new challenges. This survey provides a comprehensive overview of privacy-preserving semantic document retrieval, focusing on the intersection of SBERT-based embeddings, Federated Learning (FL), and Homomorphic Encryption (HE). We review the foundations of retrieval methods, explore encryption-based and federated frameworks for securing retrieval pipelines, and highlight the role of metaheuristic optimisation techniques (PSO, GA, ACO, BOA) in balancing accuracy, efficiency, and security. Finally, we identify key research gaps, including the trade-off between privacy and accuracy, clustering in encrypted space, and federated–encrypted integration. We discuss future directions for designing scalable, secure, and intelligent retrieval systems.