RAG Citation: Enhancing RAG with Automatic Citations
Introduction
In the age of large language models (LLMs) and AI-generated content, ensuring the credibility of generated information is critical. While Retrieval-Augmented Generation (RAG) helps boost the quality of AI outputs by retrieving relevant documents, it often lacks a mechanism to provide sources or citations for the generated text.
Enter RAG Citation — a tool that augments RAG pipelines by introducing automatic citation generation, without relying on large language models. This innovative, non-LLM approach ensures that users receive accurate and credible answers, with relevant sources cited for reference.
🔗 Check it out on:
PyPI: https://pypi.org/project/rag-citation/
GitHub: https://github.com/rahulanand1103/rag-citation
Key Features of RAG Citation
- Non-LLM Approach: Unlike most citation tools, RAG Citation does not depend on LLMs. It leverages efficient algorithms and natural language processing (NLP) techniques, making it faster and lightweight, ideal for projects prioritising speed and efficiency.
- Semantic Search: Instead of merely matching keywords, RAG Citation performs the semantic search, finding documents based on the meaning and context of the query. This makes citations more relevant and meaningful to the answers provided.
- Named Entity Recognition (NER): The tool extracts key entities such as people, dates, organizations, and numbers (e.g.,
MONEY
,DATE
,ORDINAL
), improving the accuracy and clarity of citations by directly linking relevant entities to the source documents. - Flexible Integration: RAG Citation can seamlessly integrate into any RAG pipeline, making it highly adaptable to different use cases — whether you’re building a search engine, chatbot, or research tool.
- Hallucination Detection (Beta): RAG Citation can flag potential hallucinations in its beta feature. It checks for instances where entities like
DATE
,MONEY
, orQUANTITY
in the generated text are not found in the retrieved documents, identifying mismatches and improving trust in the output.
Quickstart Guide
To get started with RAG Citation, follow the steps below:
Installation
You can install RAG Citation directly from PyPI using pip:
pip install rag-citation
Example Usage
Here’s a basic example of how to use the RAG Citation library:
from rag_citation import CiteItem, Inference
import uuid
# Sample documents from a vector database or semantic search
documents = [
"Elon Musk CEO, Tesla $221.6B Real Time Net Worth as of 8/6/24...",
"As of August 2024, Forbes estimates Musk's net worth to be US$241 billion..."
]
# Sample answer generated by an LLM
answer = "Elon Musk's net worth is estimated to be US$241 billion as of August 2024."
# Helper function to generate a unique identifier
def generate_uuid():
return str(uuid.uuid4())
# Format documents for citation input
def format_document(documents):
context = []
for document in documents:
context.append({
"source_id": generate_uuid(),
"document": document,
"meta": [{"meta-data": "some-info"}],
})
return context
context = format_document(documents)
cite_item = CiteItem(answer=answer, context=context)
# Initialize the Inference model with spaCy and embedding models
inference = Inference(spacy_model="sm", embedding_model="md")
# Generate citations, hallucination flags, and missing entities
output = inference(cite_item)
print("------ Citation ------")
print(output.citation)
print("------ Hallucination ------")
print(output.hallucination)
print("------ Missing Entities ------")
print(output.missing)
Understanding the Output
The output
from RAG Citation includes three key components:
- Citation: Contains the actual citations, linking the generated answer with the source document, recognized entities like
MONEY
orDATE
, and metadata like URLs.
Example:
{
"answer_sentences": "Elon Musk's net worth is estimated to be US$241 billion as of August 2024.",
"cite_document": [
{
"document": "As of August 2024, Forbes estimates Musk's net worth to be US$241 billion...",
"source_id": "23d1f1f0-2afa-4749-8639-78ec685fd837",
"entity": [
{"word": "US$241 billion", "entity_name": "MONEY"},
{"word": "August 2024", "entity_name": "DATE"}
],
"meta": [
{"url": "https://www.forbes.com/profile/elon-musk/"}
]
}
]
}
- Hallucination: A boolean indicating whether the generated answer contains hallucinated entities (i.e., entities that cannot be found in the source document).
Example:False
Missing Entities: Lists any expected entities (like MONEY
or DATE
) that were mentioned in the answer but couldn't be found in the source.
Example: ["US$241 billion"]
Configuring the Inference Model
RAG Citation is flexible when it comes to configuration. The Inference
class supports different models for both named entity recognition (NER) and sentence embeddings.
- spaCy Models: These are used for NER and can be configured based on your needs:
"sm"
foren_core_web_sm
or"md"
foren_core_web_md
or"lg"
foren_core_web_lg
- Sentence Embedding Models: These models from SentenceTransformers are used for semantic similarity matching:
"sm"
forGIST-small
or"md"
forGIST-medium
or"lg"
forGIST-large
- You can adjust the similarity threshold (default:
0.88
) to control how strict or lenient the matching process is.