RAG Citation: Enhancing RAG with Automatic Citations

Rahul Anand
3 min readOct 1, 2024

--

Introduction

In the age of large language models (LLMs) and AI-generated content, ensuring the credibility of generated information is critical. While Retrieval-Augmented Generation (RAG) helps boost the quality of AI outputs by retrieving relevant documents, it often lacks a mechanism to provide sources or citations for the generated text.

Enter RAG Citation — a tool that augments RAG pipelines by introducing automatic citation generation, without relying on large language models. This innovative, non-LLM approach ensures that users receive accurate and credible answers, with relevant sources cited for reference.

🔗 Check it out on:
PyPI: https://pypi.org/project/rag-citation/
GitHub: https://github.com/rahulanand1103/rag-citation

Key Features of RAG Citation

  1. Non-LLM Approach: Unlike most citation tools, RAG Citation does not depend on LLMs. It leverages efficient algorithms and natural language processing (NLP) techniques, making it faster and lightweight, ideal for projects prioritising speed and efficiency.
  2. Semantic Search: Instead of merely matching keywords, RAG Citation performs the semantic search, finding documents based on the meaning and context of the query. This makes citations more relevant and meaningful to the answers provided.
  3. Named Entity Recognition (NER): The tool extracts key entities such as people, dates, organizations, and numbers (e.g., MONEY, DATE, ORDINAL), improving the accuracy and clarity of citations by directly linking relevant entities to the source documents.
  4. Flexible Integration: RAG Citation can seamlessly integrate into any RAG pipeline, making it highly adaptable to different use cases — whether you’re building a search engine, chatbot, or research tool.
  5. Hallucination Detection (Beta): RAG Citation can flag potential hallucinations in its beta feature. It checks for instances where entities like DATE, MONEY, or QUANTITY in the generated text are not found in the retrieved documents, identifying mismatches and improving trust in the output.

Quickstart Guide

To get started with RAG Citation, follow the steps below:

Installation

You can install RAG Citation directly from PyPI using pip:

pip install rag-citation

Example Usage

Here’s a basic example of how to use the RAG Citation library:

from rag_citation import CiteItem, Inference
import uuid

# Sample documents from a vector database or semantic search
documents = [
"Elon Musk CEO, Tesla $221.6B Real Time Net Worth as of 8/6/24...",
"As of August 2024, Forbes estimates Musk's net worth to be US$241 billion..."
]

# Sample answer generated by an LLM
answer = "Elon Musk's net worth is estimated to be US$241 billion as of August 2024."

# Helper function to generate a unique identifier
def generate_uuid():
return str(uuid.uuid4())

# Format documents for citation input
def format_document(documents):
context = []
for document in documents:
context.append({
"source_id": generate_uuid(),
"document": document,
"meta": [{"meta-data": "some-info"}],
})
return context

context = format_document(documents)
cite_item = CiteItem(answer=answer, context=context)

# Initialize the Inference model with spaCy and embedding models
inference = Inference(spacy_model="sm", embedding_model="md")

# Generate citations, hallucination flags, and missing entities
output = inference(cite_item)

print("------ Citation ------")
print(output.citation)
print("------ Hallucination ------")
print(output.hallucination)
print("------ Missing Entities ------")
print(output.missing)

Understanding the Output

The output from RAG Citation includes three key components:

  • Citation: Contains the actual citations, linking the generated answer with the source document, recognized entities like MONEY or DATE, and metadata like URLs.
    Example:
{
"answer_sentences": "Elon Musk's net worth is estimated to be US$241 billion as of August 2024.",
"cite_document": [
{
"document": "As of August 2024, Forbes estimates Musk's net worth to be US$241 billion...",
"source_id": "23d1f1f0-2afa-4749-8639-78ec685fd837",
"entity": [
{"word": "US$241 billion", "entity_name": "MONEY"},
{"word": "August 2024", "entity_name": "DATE"}
],
"meta": [
{"url": "https://www.forbes.com/profile/elon-musk/"}
]
}
]
}
  • Hallucination: A boolean indicating whether the generated answer contains hallucinated entities (i.e., entities that cannot be found in the source document).
    Example: False

Missing Entities: Lists any expected entities (like MONEY or DATE) that were mentioned in the answer but couldn't be found in the source.
Example: ["US$241 billion"]

Configuring the Inference Model

RAG Citation is flexible when it comes to configuration. The Inference class supports different models for both named entity recognition (NER) and sentence embeddings.

  • spaCy Models: These are used for NER and can be configured based on your needs:"sm" for en_core_web_sm or "md" for en_core_web_md or "lg" for en_core_web_lg
  • Sentence Embedding Models: These models from SentenceTransformers are used for semantic similarity matching:"sm" for GIST-small or "md" for GIST-medium or "lg" for GIST-large
  • You can adjust the similarity threshold (default: 0.88) to control how strict or lenient the matching process is.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rahul Anand
Rahul Anand

No responses yet

Write a response