Building RAG Pipelines for Threat Intelligence
Threat intelligence analysis requires correlating disparate data sources: CVE databases, dark web forum discussions, internal telemetry logs, and open-source indicators of compromise. Traditional keyword-based search fails to capture semantic relationships — a query for "Log4j exploitation" should surface discussions about JNDI injection, RCE payloads, and Minecraft server attacks even when the specific term is absent.
The RAG Architecture
Retrieval-Augmented Generation (RAG) enhances LLM outputs by injecting relevant context from external knowledge bases. Our pipeline uses sentence-transformers/all-MiniLM-L6-v2 for embedding generation, ChromaDB for vector storage with metadata filtering, and Llama 3.1 70B for inference. The critical optimization is chunking strategy: CVE descriptions are embedded whole, forum posts are split by thread boundaries, and log entries are grouped by session identifiers.
# Retrieval pipeline architecture
class ThreatIntelRAG:
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.vector_store = ChromaDB(
collection_name="threat_intel",
embedding_function=self.embedder.encode,
metadata_filters=["source", "date", "confidence"]
)
self.llm = Llama(model_path="llama-3.1-70b.gguf")
def query(self, question: str, k: int = 5) -> str:
# Embed query
query_embedding = self.embedder.encode(question)
# Retrieve with metadata boosting
contexts = self.vector_store.similarity_search(
query_embedding,
k=k,
filter={"date": {"$gte": "2024-01-01"}}
)
# Augment prompt
prompt = self.build_rag_prompt(question, contexts)
return self.llm.generate(prompt)Data Ingestion Sources
Our ingestion pipeline processes four primary feeds: (1) NVD CVE JSON feeds updated hourly with CVSS scoring and CPE configurations, (2) Tor hidden services scraped via Selenium with JavaScript execution for forum content, (3) GreyNoise API for IP reputation and scan observations, and (4) internal SIEM alerts normalized to OCSF schema. Each source is tagged with provenance metadata for retrieval filtering.
"The correlation engine identified 340% more related IOCs compared to keyword search alone. Semantic retrieval surfaced attack patterns from forum discussions that never explicitly mentioned the CVE identifier but described identical TTPs."
Evaluation and Metrics
Measuring RAG quality in threat intelligence requires domain-specific metrics. We evaluate using: (1) attack technique coverage — does the response cite relevant MITRE ATT&CK techniques, (2) IOC precision — are extracted indicators actually related to the threat, and (3) temporal accuracy — does the analysis respect disclosure timelines and patch availability dates. Human analysts score responses on a 5-point Likert scale, with the current pipeline achieving 4.2 average relevance.
The system runs inference on local GPU infrastructure to prevent data exfiltration of sensitive IOCs. For classified environments, we deploy air-gapped embeddings using domain-specific fine-tuning on internal threat reports. The complete pipeline, from ingestion to query response, processes 10,000 documents per hour with p99 latency under 2 seconds.