🗺️
How to use this
Six sections in the sidebar, each with 3–4 short pages (~3 min read). Pick a learning path below, jump to a section card, or search topics from the sidebar.

Learning paths

For beginners
Build your first agent in a day
RAG-first
Ground agents in your own data
Self-hosted
Air-gap, zero per-token cost

Reference map

Foundations

The mental model behind every agent — what they are, how they loop, and where memory lives.

Running Locally

Self-hosted inference with Ollama — privacy, zero per-token cost, fully offline-capable.

Cloud APIs

The three frontier providers — current pricing, tradeoffs, and minimal quickstart code.

Agentic Patterns

Reusable shapes for organising an agent's reasoning — pick the one that matches the task.

RAG & Embeddings

Retrieve from your own corpus and feed as context — the antidote to hallucination on private data.

Ecosystem

The major frameworks — when to reach for each, and where they overlap.

When should I use what?

If you need…Reach forWhy
Cheapest at scaleGemini 2.5 Flash · GPT-4.1 nano~$0.10–0.30 per 1M input. Solid for classification, extraction, simple chat.
Best reasoningo3 · Claude Opus 4.7Multi-step logic, math, code generation that needs to actually run.
Longest contextGemini 2.5 Pro (1M) · GPT-5.5 (1M)Whole books, long PDFs, video transcripts.
Privacy / offlineLLaMA 3.1 via OllamaNo data leaves the box. Zero per-token cost. Needs 8GB+ RAM (8B) or 48GB+ VRAM (70B).
Coding agentsClaude Sonnet 4.6 · GPT-5.4Sonnet 4.6 edges out on long codebases; GPT-5.4 has tighter tool-use ergonomics.
Structured JSON outputOpenAI · GeminiNative schema-mode is the most mature. Anthropic is closing the gap.
High-volume tool useGPT-4.1 · Claude Haiku 4.5Best tool-use cost/quality ratio for production agents at scale.

This reference reflects what I'm using in production agents today — not every framework, just the ones worth knowing. Each page below is short on theory and heavy on what actually matters when you ship.

The core idea

A traditional program follows a fixed sequence of steps you define. An AI agent, by contrast, decides at runtime which steps to take. It uses a language model as its "brain" — the LLM reasons over the current context, decides whether to call a tool, and reacts to the tool's output before deciding what to do next.

Think of an agent as: an LLM + a loop + access to tools. The loop runs until the agent believes the task is done (or a stop condition is hit).

💡
Key insight
The fundamental shift: you define the goal, not the procedure. The agent figures out the steps itself by reasoning at each iteration.

Agent vs. LLM call

🗣️ Single LLM Call
  • One prompt → one response
  • No memory across calls
  • No tool access
  • Fixed, linear logic
  • Use for: classification, generation, summarisation
🤖 AI Agent
  • Goal → many LLM calls in a loop
  • Maintains context window across steps
  • Calls tools, APIs, databases
  • Adaptive, branching logic
  • Use for: research, coding, task automation

Four components of every agent

🧠
LLM Brain
The reasoning engine. Reads context, decides next action, interprets tool outputs. Usually GPT-4o, Claude 3.5, Gemini 2.0 or a local LLaMA model.
🔧
Tools
Functions the LLM can call — web search, code execution, database queries, email sending, API calls. Each has a JSON schema the LLM uses to call it.
🧩
Memory
Short-term (context window), long-term (vector DB), episodic (past run summaries). Determines what the agent "knows" about the current task.
🔁
Loop / Orchestration
The controller that runs: observe → think → act → observe again. Can be a simple while-loop or a complex graph (LangGraph, Prefect).

Like a surgeon who decides which instrument to pick up next based on what they see — agents make moment-to-moment decisions grounded in the latest context, not a pre-written script.

Minimal agent in Python

Python
# Minimal agent loop (concept — not framework-specific)
from openai import OpenAI
import json

client = OpenAI()
tools = [search_tool, calculator_tool, email_tool]  # your functions

def run_agent(goal: str) -> str:
    messages = [{"role": "user", "content": goal}]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tool_schemas,        # JSON schemas for each tool
            tool_choice="auto"
        )
        msg = response.choices[0].message

        if msg.tool_calls:
            for tc in msg.tool_calls:
                result = call_tool(tc.function.name, tc.function.arguments)
                messages.append({"role": "tool", "content": result})
        else:
            return msg.content   # agent decided it's done

Observe → Think → Act

Core Agent Loop
👁️
Observe
Gather context
🧠
Think
LLM reasons
Act
Call tool or respond
📥
Update
Add result to context
↩ Loop back to Observe until goal achieved or stop condition met

Stop conditions

Agents need clear termination logic. Common patterns:

1
LLM decides it's done
The model returns a final answer (no tool call). This is the most common pattern in ReAct-style agents.
2
Max iterations reached
Hard cap on the number of tool calls. Essential safety net — set via max_iterations=20 in LangChain agents.
3
Structured output produced
Agent is instructed to call a special finish() tool with its final structured output, ensuring clean termination and parseable results.
4
External signal / HITL approval
Agent pauses and waits for human confirmation before proceeding. Used in high-stakes pipelines (finance, emails to clients).
⚠️
Infinite loop risk
Always set a max_iterations limit. Without it, a confused agent can spin forever, burning API credits. LangChain's default is 15 iterations.

How tool calling works

You provide the LLM with a list of tool schemas (JSON). When the model wants to call a tool, it outputs a structured JSON object with the tool name and arguments — you intercept this, run the real function, and feed the result back.

JSON — Tool Schema
{
  "type": "function",
  "function": {
    "name": "search_schedules",
    "description": "Search logistics schedules by date range and carrier",
    "parameters": {
      "type": "object",
      "properties": {
        "start_date": { "type": "string", "description": "ISO 8601 date" },
        "end_date":   { "type": "string" },
        "carrier":    { "type": "string", "enum": ["FedEx","UPS","DHL"] }
      },
      "required": ["start_date"]
    }
  }
}

Common tool categories

🌐
Web Search
Tavily, SerpAPI, or Bing Search. Gives the agent access to up-to-date information beyond its training cut-off.
💻
Code Execution
Run Python in a sandbox (E2B, Docker). Lets agents do maths, data analysis, chart generation programmatically.
🗃️
Database
Read/write SQLite, Postgres, ChromaDB. Essential for agents that need structured data or persistent memory.
📧
Email / Comms
SMTP, Gmail API, Slack SDK. Used in autonomous pipelines where the agent must respond to or send communications.
📁
File System
Read/write local files. Useful for agents that process documents, generate reports, or work with codebases.
🔗
External APIs
Any REST/GraphQL API the agent needs. Weather, stocks, maps, CRMs, ERPs — just wrap the HTTP call.

Four memory tiers

💬
In-Context (Working)
The active conversation window. Fast, zero-cost, but bounded. 128k tokens ≈ ~100k words. Lost when session ends.
🗃️
Semantic (Vector)
Past knowledge stored as embeddings in ChromaDB, Pinecone, Weaviate. Retrieved by cosine similarity. Survives across runs.
📋
Episodic (Summaries)
Compressed summaries of past sessions injected into new context windows. Lets the agent "remember" without storing raw transcripts.
🔢
Structured (DB)
Hard facts in SQL / key-value stores. Use for: user preferences, completed tasks, schedules, confirmed bookings.

The best agents combine all four — structured DB for facts, vector store for fuzzy recall, episodic summaries for continuity, and the context window for active reasoning.

Installation

Shell
# macOS (homebrew)
brew install ollama

# Or download the app from ollama.ai
# Then start the server:
ollama serve

# Pull a model (downloads ~4-8 GB)
ollama pull llama3.1          # Meta's LLaMA 3.1 8B
ollama pull mistral           # Mistral 7B — fast & capable
ollama pull nomic-embed-text  # For embeddings (768-dim)
ollama pull qwen2.5-coder     # Qwen for code tasks

# Test in terminal
ollama run llama3.1 "Explain RAG in one paragraph"

Call from Python

Python
import requests

def ollama_chat(prompt: str, model="llama3.1") -> str:
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return r.json()["response"]

# Or via LangChain (much easier for agents)
from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.1", temperature=0)
response = llm.invoke("Summarise this email: ...")
Hardware tip
8B models run comfortably on a Mac with 16 GB RAM (Apple Silicon). For 70B models you'll need 64 GB+ RAM or a GPU with VRAM. Check ollama list to see downloaded models.

Stack overview

🦙
Ollama
LLM inference server. Serves llama3.1 + nomic-embed-text via REST.
⛓️
LangChain
Agent orchestration — handles the loop, tool routing, memory injection.
🎨
ChromaDB
Local vector database. Stores and retrieves embeddings by cosine similarity.
🗄️
SQLite
Structured data — schedules, task history, confirmed decisions.
🚀
FastAPI
Optional: HTTP API layer for the agent so it can be triggered by webhooks.

Full setup

Shell
# 1. Install dependencies
pip install langchain langchain-ollama langchain-community
pip install chromadb fastapi uvicorn python-dotenv

# 2. Start Ollama
ollama serve &
ollama pull llama3.1
ollama pull nomic-embed-text
Python — Local RAG Agent
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain.tools.retriever import create_retriever_tool

# LLM + embeddings — both local via Ollama
llm = ChatOllama(model="llama3.1", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Vector store — persisted to disk
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Wrap retriever as a tool
rag_tool = create_retriever_tool(
    retriever,
    name="search_schedules",
    description="Searches logistics schedules and past emails"
)

# Build agent
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a logistics scheduling assistant."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, [rag_tool], prompt)
executor = AgentExecutor(agent=agent, tools=[rag_tool], verbose=True)

result = executor.invoke({"input": "Any conflicts on March 15?"})
ℹ️
Pricing note
API pricing changes frequently. Always verify at the provider's official pricing page before budgeting a production system. Batch API tiers typically cut costs by ~50%; prompt caching can cut input costs by up to 90% on repeated context.
ModelProviderInput (per 1M)Output (per 1M)ContextBest for
Claude Opus 4.7Anthropic$5.00$25.00200kFlagship reasoning, hardest tasks
Claude Sonnet 4.6Anthropic$3.00$15.00200kCoding, long-context analysis
Claude Haiku 4.5Anthropic$1.00$5.00200kFast classification, triage
GPT-5.5OpenAI$5.00$30.001MFrontier general-purpose, agents
GPT-5.4OpenAI$2.50$15.001MProduction workhorse, balanced
GPT-4.1OpenAI$2.00$8.00128kTool use, structured output
GPT-4.1 nanoOpenAI$0.10$0.40128kCheapest, simple tasks at scale
o3OpenAI$2.00$8.00200kReasoning — math, code, logic
o4-miniOpenAI$0.55$2.20200kCheap reasoning, high volume
Gemini 2.5 ProGoogle$1.25$10.001MMassive docs, multimodal
Gemini 2.5 FlashGoogle$0.30$2.501MHigh-volume multimodal, cheap
LLaMA 3.1 8BLocal (Ollama)FreeFree128kPrivacy, edge, zero per-token cost
LLaMA 3.1 70BLocal (Ollama)FreeFree128kLocal frontier, needs ≥48GB VRAM
Mistral 7BLocal (Ollama)FreeFree32kLightweight instruction following

For a production agent handling 10,000 emails/month at ~1,000 tokens each: Claude Sonnet 4.6 costs ~$180/mo vs. GPT-4.1 nano at ~$5/mo vs. local Ollama at $0 (hardware aside). With prompt caching on a fixed system prompt + RAG context, input costs typically drop 70–90% — making frontier models competitive with smaller ones at scale.

Quick start

Python
pip install openai

from openai import OpenAI

client = OpenAI(api_key="sk-...")   # or use OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this email..."}]
)
print(response.choices[0].message.content)

Quick start

Python
pip install anthropic

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyse this contract..."}]
)
print(message.content[0].text)

Quick start

Python
pip install google-generativeai

import google.generativeai as genai

genai.configure(api_key="AIza...")

model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content("Explain vector embeddings")
print(response.text)

Thought → Action → Observation cycle

Agent Trace — ReAct
Goal: "Is there a scheduling conflict on March 15th for DHL?"

Thought: I need to search the schedule database for March 15 DHL entries.
Action: search_schedules({"start_date": "2025-03-15", "carrier": "DHL"})
Observation: Found 3 entries. Two overlap: 14:00 pickup & 14:30 delivery.

Thought: There's a 30-minute conflict. I should flag this for review.
Action: create_conflict_alert({"time": "14:00", "type": "overlap"})
Observation: Alert #4821 created, assigned to scheduler queue.

Thought: Task complete. I have the answer and have taken action.
Final Answer: Yes, conflict found at 14:00-14:30 on March 15 (DHL).
             Alert #4821 created for review.
🔬
Why ReAct works
By externalising reasoning as text, the LLM "thinks before it acts." This dramatically reduces hallucinations and makes debugging trivial — you can read the full thought trace.

Orchestrator + Worker pattern

Multi-Agent Topology
👑
Orchestrator Agent
Routes tasks, aggregates results
🔍
Research Agent
Web search + docs
💻
Coder Agent
Generate + run code
📝
Writer Agent
Compose output

When to use multi-agent

A single agent handles most tasks up to moderate complexity. Add multiple agents when: tasks are parallelisable (research while coding), specialisation improves quality (a dedicated critic agent reviewing a writer agent's output), or context limits are hit (each agent handles a slice of a large document).

Confidence threshold routing

Python — HITL Router
CONFIDENCE_THRESHOLD = 0.88

async def route_decision(decision: Decision) -> ActionResult:
    if decision.confidence >= CONFIDENCE_THRESHOLD:
        # High confidence — auto-execute
        return await execute_action(decision)
    else:
        # Low confidence — queue for human review
        review_id = await queue_for_review({
            "decision": decision,
            "confidence": decision.confidence,
            "reasoning": decision.reasoning,
            "suggested_action": decision.action
        })
        return ActionResult(status="pending_review", review_id=review_id)
🔑
HITL is a feature, not a limitation
The best production agents aren't fully autonomous — they're autonomy-calibrated. Low confidence + high stakes = always route to human. High confidence + low stakes = auto-execute. Tune thresholds empirically against your domain.

Why separate planning from execution

ReAct agents interleave planning and execution — they can go down wrong paths and recover, but may be inefficient. Plan & Execute forces the LLM to think through the full approach before acting. This is better for tasks where the steps are known upfront and backtracking is expensive (e.g. writing a long document, running a multi-step analysis).

1
Planner LLM call
Given the goal, generate a JSON array of steps: [{step: "Search for Q1 data"}, {step: "Calculate averages"}, …]
2
Executor loop
For each step, call a ReAct sub-agent that executes just that step and returns a result.
3
Re-plan if needed
After each step, optionally let the planner revise remaining steps based on what was discovered.

The two pipelines

RAG Architecture — Ingestion + Retrieval
INGESTION (offline)
📄 Raw Documents
PDFs, emails, markdown, web pages, SQL tables
✂️ Chunker
Split into ~500 token overlapping chunks
🔢 Embed
nomic-embed-text → 768-dim float vector per chunk
🗃️ ChromaDB
Store (vector, metadata, raw text)
─────────────────────────────────
RETRIEVAL (at query time)
❓ User Query
"Any DHL conflicts on March 15?"
🔢 Embed Query
Same model → 768-dim query vector
🔍 Similarity Search
cosine(q, docs) — top-k above threshold 0.72
🧠 LLM + Context
Retrieved chunks injected into prompt → grounded answer

RAG vs. fine-tuning

📚 RAG
  • Knowledge updated instantly (re-ingest)
  • No GPU or training cost
  • Cites sources (transparent)
  • Works with any LLM
  • Limited by context window at retrieval
🎯 Fine-tuning
  • Knowledge baked into weights
  • Expensive GPU training required
  • Better for style/format changes
  • Faster at inference (no retrieval step)
  • Stale: requires re-training to update
Rule of thumb
Use RAG when you need the LLM to reason over your data. Use fine-tuning when you need the LLM to behave differently (different tone, format, domain-specific reasoning patterns).

What is an embedding?

An embedding model converts a string of text into a list of floating-point numbers — a vector in high-dimensional space. nomic-embed-text produces 768 numbers. OpenAI's text-embedding-3-large produces 3,072.

The magic: semantically similar texts produce vectors that are geometrically close. "DHL shipment delayed" and "FedEx delivery postponed" have high cosine similarity even though they share no words.

Cosine similarity

Python
import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite
# Typical RAG threshold: 0.70–0.80 (tune for your domain)

# With OllamaEmbeddings:
from langchain_ollama import OllamaEmbeddings
emb = OllamaEmbeddings(model="nomic-embed-text")

v1 = emb.embed_query("DHL shipment delayed")       # 768 floats
v2 = emb.embed_query("FedEx delivery postponed")    # 768 floats
print(cosine_similarity(v1, v2))  # → ~0.87

Embedding model comparison

ModelDimsCostBest for
nomic-embed-text768Free (local)General purpose, good quality/speed
mxbai-embed-large1024Free (local)High quality, slower
text-embedding-3-small1536$0.02/1M tokensBest OpenAI value
text-embedding-3-large3072$0.13/1M tokensHighest accuracy
text-embedding-004 (Google)768Free tierGemini ecosystem

Four main strategies

📏
Fixed Size
Split every N tokens regardless of content. Simple, fast. Add 10–20% overlap to avoid cutting mid-sentence. Default: 512 tokens, 50 overlap.
🔤
Sentence / Paragraph
Split on natural boundaries (sentences, paragraphs). Better semantic coherence, variable chunk size. Best for prose documents.
📑
Semantic
Embed sentences, then group consecutive sentences with similar embeddings. More expensive but best quality. Use SemanticChunker in LangChain.
🌲
Hierarchical (RAPTOR)
Build a tree: embed chunks → cluster → summarise clusters → embed summaries. Enables both detailed and high-level retrieval.
Python — LangChain Chunkers
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,  # best default choice
    SentenceTransformersTokenTextSplitter,
)
from langchain_experimental.text_splitter import SemanticChunker

# Recursive — respects paragraphs > sentences > words
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)

# Semantic — groups similar sentences (best quality)
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"  # or "standard_deviation"
)

Comparison

DatabaseTypeSelf-hostedBest for
ChromaDBLocal / embeddedYesLocal dev, small datasets, privacy
PineconeManaged cloudNoProduction at scale, no ops overhead
WeaviateOpen-source / cloudYesHybrid search (BM25 + semantic)
QdrantOpen-source / cloudYesRust-based, high performance
pgvectorPostgres extensionYesAlready using Postgres, small-medium scale
FAISSIn-memory libraryYesResearch, large batches, no persistence needed

For getting started: ChromaDB requires zero infrastructure and zero config. When you're ready for production scale, migrate to Pinecone or Qdrant with the same LangChain abstraction layer.

Core abstractions

🔗
Chain (LCEL)
Compose LLM calls, prompts, parsers, and tools with pipe syntax: prompt | llm | parser. LangChain Expression Language.
🤖
AgentExecutor
Wraps an agent + tool list into a runnable loop. Handles tool dispatch, error recovery, and max iteration limits.
📚
Retriever
Standard interface for vector stores, BM25, web search. Swap ChromaDB for Pinecone with no code changes.
🧠
Memory
Conversation buffer, summary memory, vector memory. Manages what gets injected into the context window each turn.

Simple RAG chain (LCEL)

Python — LangChain LCEL
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template("""
Answer based on this context:
{context}

Question: {question}
""")

# Pipe syntax: each step feeds the next
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What deliveries are due on Friday?")
🌐
LangGraph vs. LangChain AgentExecutor
LangChain's AgentExecutor is great for simple ReAct loops. LangGraph is for complex workflows: conditional branching, parallel nodes, human-in-the-loop checkpoints, and persistent state that survives across API calls.

State graph pattern

Python — LangGraph
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    confidence: float
    needs_review: bool

def analyse_email(state: AgentState) -> AgentState:
    # Call LLM, update state
    ...

def check_confidence(state: AgentState) -> str:
    return "human_review" if state["confidence"] < 0.88 else "send_reply"

builder = StateGraph(AgentState)
builder.add_node("analyse", analyse_email)
builder.add_node("send_reply", send_email)
builder.add_node("human_review", queue_for_review)

builder.set_entry_point("analyse")
builder.add_conditional_edges("analyse", check_confidence)
builder.add_edge("send_reply", END)
builder.add_edge("human_review", END)

graph = builder.compile()

Key differences from LangChain

⛓️ LangChain
  • General-purpose orchestration
  • Agents, chains, tools, memory
  • Larger ecosystem
  • More boilerplate for RAG
🦙 LlamaIndex
  • Document-first RAG specialist
  • Automatic ingestion pipelines
  • Advanced indexing (RAPTOR, etc.)
  • Less code for pure RAG use cases
Python — LlamaIndex Quick Start
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load all docs from a folder
documents = SimpleDirectoryReader("./data").load_data()

# Index (embed + store) automatically
index = VectorStoreIndex.from_documents(documents)

# Query with natural language
query_engine = index.as_query_engine()
response = query_engine.query("Summarise all DHL-related emails")

Agents, Tasks, Crews

Python — CrewAI
from crewai import Agent, Task, Crew

researcher = Agent(
    role="Logistics Analyst",
    goal="Find schedule conflicts in incoming emails",
    backstory="Expert in supply chain scheduling with 10 years experience",
    tools=[search_schedules, check_calendar],
    llm=llm
)

writer = Agent(
    role="Email Composer",
    goal="Draft professional responses to logistics partners",
    backstory="Specialist in B2B communications",
    llm=llm
)

analyse_task = Task(
    description="Review the email and identify any conflicts",
    agent=researcher, expected_output="Conflict report"
)

reply_task = Task(
    description="Draft a reply addressing identified conflicts",
    agent=writer, expected_output="Email draft"
)

crew = Crew(agents=[researcher, writer], tasks=[analyse_task, reply_task])
result = crew.kickoff()
🚢
When to use CrewAI
CrewAI shines when your problem maps naturally to roles: researcher, writer, critic, coder. The role-based framing makes it easy to onboard non-engineers who can reason about "who does what" without understanding agent internals.

Model recommendations by task

TaskRecommendedWhy
General agent reasoningllama3.1:8bBest small model for tool use and instruction following
Codingqwen2.5-coder:7bFine-tuned specifically for code generation
Fast classificationmistral:7bVery fast, good enough for binary/classification tasks
Embeddingsnomic-embed-textHigh quality 768-dim, fast inference
Long context (local)llama3.1:70bNeeds 64GB+ RAM — best local quality at any context
Vision + textllava:13bMultimodal, process images + text locally