A sophisticated search engine that provides semantic search capabilities over academic papers, combining the power of Latent Semantic Indexing (LSI) with modern BERT-based neural language models to address synonymy and polysemy challenges in scholarly literature retrieval.

Latent Semantic Index SearchEngine

This search engine implements a hybrid two-stage architecture that combines field-weighted Latent Semantic Indexing (LSI) with BERT-based neural language models.

  • Stage 1 (Document Processing) uses field-weighted LSI enhanced with KeyBERT keyword extraction for efficient candidate selection, applying higher weights to:

    • Keywords (3.0x),

    • document titles (3.0x) and

    • abstracts (1.5x)

  • Stage 2 (query processing) provides

    • optional BERT-based semantic re-ranking using Sentence-BERT embeddings

    • with FAISS (Facebook AI Similarity Search) for approximate nearest neighbor search,

    • followed by a fine-tuned BERT classifier that scores document relevance.

Evaluation Results
Project Overview

The following indexes were built for comparison:

  • BM25: Traditional probabilistic ranking baseline

  • Basic LSI: Standard Latent Semantic Indexing with uniform weighting

  • Field-Weighted LSI: Enhanced LSI with title/abstract/body field weighting

  • BERT-Enhanced LSI: Field-weighted LSI with KeyBERT keyword enhancement

  • Complete Hybrid System: Full system with BERT-based semantic re-ranking

Our complete hybrid system was evaluated on 30 academic queries using NDCG@10 metrics with JudgeBlender relevance assessment. The field-weighted LSI with BERT-enhanced indexing configuration achieved the highest performance with a mean NDCG@10 of 0.9716, significantly outperforming traditional baselines including BM25 (0.9581) and basic LSI variants. The BERT re-ranking component, fine-tuned on 100,000 MS MARCO query-abstract pairs, demonstrated exceptional semantic similarity assessment with accuracy, precision, and recall scores all exceeding 0.986.