C.A. Micchelli, W.L. Miranker
Journal of the ACM
Widespread adoption of scRNA-seq technologies has resulted in large volumes of published literature and datasets where important gene and cell-specific information may be hidden in unstructured and structured formats. We introduce a multi-agent framework to expand traditional search and retrieval of relevant information beyond direct relations between entities. Our modular LangGraph-based framework coordinates multiple agents and tools to answer queries about cells/genes from multiple sources that may not have direct answers.
The agentic framework includes: (1) question parser agent, handling user queries with a Large Language Model (LLM); (2) planner agent, selecting the relevant agents needed to satisfy the user query; (3) literature search agent, a broad scope entity search strategy coupling open-source IBM Deep Search toolkit for document search, LLMs with Retrieval-Augmented Generation(RAG) and biomedical text annotators to combinatorially extract entities and relations from text; (4) experiment search tool, to search for gene information from experiment datasets (e.g.,CellXGene). Finally, charts and reports are generated by combining the results.
Experiments consisted of querying known gene markers of human intestinal cells (e.g., Enteroendocrine, Tuft cells) with our agents, mining PubMed-Central and CellXGene, and comparing against our own independent lab experiment results. All agents were evaluated independently for consistency and accuracy while testing multiple prompting strategies using open-source LLMs(Llama3, DeepSeek). Initial results indicate LLM performance is task-dependent and highlight strengths (e.g., entity-relations extraction) and limitations (e.g., gene aliases expansion) where additional LLM domain adaptation is needed. Future work will introduce agents that leverage biomedical models, knowledge graphs, and other bioinformatics tools.