Version 0.2.0

Quira Documentation

The high-performance Retrieval Augmented Generation framework built from the ground up for token efficiency and zero perceived latency.

Installation

Quira is distributed via PyPI. We highly recommend installing the [all] variant, which automatically pulls in the official client libraries for our supported vector databases and LLM providers.

Terminal

$pip install "quira[all]"

If you prefer a lightweight installation and want to manage dependencies yourself, use pip install quira.

Speculative Retrieval

Standard RAG pipelines suffer from high latency because retrieval happens sequentially after the user submits their query. Network calls to vector databases (like Pinecone or Qdrant) can take anywhere from 200ms to over 500ms.

How it works

Quira tracks keyboard typing speeds in your UI. It implements advanced debounce logic and speculatively searches the database while the user is still typing. By the time the user presses "Enter", the relevant chunks are already loaded in local memory, reducing perceived latency to absolutely zero.

Context Tetris

Language models have strict context window limits. Instead of blindly passing the top-K retrieved chunks (which often leads to repetitive or irrelevant context), Quira employs a dynamic scoring algorithm. It intelligently packs the most valuable chunks into your remaining token budget based on four strict dimensions.

1. Relevance

Standard cosine similarity between the embedded query and the document chunks.

2. Recency

A temporal decay function. Newer chunks in the conversation history are penalized less than older chunks.

3. Density

Chunks with a high concentration of specific keywords related to the query receive a significant point boost.

4. Uniqueness

Semantically identical chunks are heavily penalized, preventing the LLM from reading the same information twice.

Differential Context

Quira maintains an internal conversational state. During a continuous session, the model already "knows" what it read in previous turns. Instead of fetching the entire knowledge base again, Quira computes a semantic delta and only fetches new informationthat hasn't been discussed yet. This drastically cuts down on redundant database hits and LLM token usage.

Provider Abstraction

Write your pipeline once, deploy it anywhere. Quira features a robust abstraction layer that allows you to swap your entire infrastructure (Vector Stores, Caches, and LLMs) simply by changing a string. No refactoring required.

pipeline.py

from quira import quiraPipeline

# Instantly swap providers without changing your business logic
pipeline = quiraPipeline(
vector_store="pinecone",        # Or: qdrant, chroma, weaviate
cache="redis",                 # Or: memory, disk
llm="anthropic/claude-3-opus"  # Or: openai, groq, ollama
)

Integrations

Quira is designed to be a drop-in upgrade for the frameworks you already use. We provide native wrappers for both LangChain and LlamaIndex.

LangChain Retriever

from quira.integrations import QuiraRetriever

retriever = QuiraRetriever(pipeline=pipeline)
docs = retriever.invoke("What is speculative retrieval?")

LlamaIndex Query Engine

from quira.integrations import QuiraQueryEngine

engine = QuiraQueryEngine(pipeline=pipeline)
response = engine.query("What is context tetris?")