Search
20 results
AI Tracing Platforms: Langfuse, LangSmith, Helicone, Phoenix
Compare tracing and observability platforms specifically for LLM and agent applications.
AI Evaluation Platforms: When to Buy vs Build
Eval platforms (Braintrust, LangSmith, Weights & Biases) accelerate teams. The buy-vs-build call depends on team size, use cases, and customization needs.
Comparing AI Evaluation Platforms
Eval platforms (Braintrust, LangSmith, Weights & Biases) all support evaluation differently. Selection matters.
LLM Observability Tools: What to Trace, What to Sample, What to Alert
LLM observability tools (LangSmith, LangFuse, Helicone, Datadog LLM, custom) all trace conversations. The differentiation is in evaluation, dashboards, and alerting — and choosing the wrong tool wastes months.
AI Observability Stack 2026: Traces, Metrics, and Cost in One Pane
Building a unified view across LangSmith, Datadog LLM Observability, OpenTelemetry, and custom dashboards.
AI Agent Evaluation Platforms in 2026
Compare LangSmith, Braintrust, Humanloop and friends for evaluating multi-step agent traces.
Building with LangGraph
LangGraph became the production favorite in 2026 for good reasons — explicit state, checkpointing, first-class MCP. Build a real agent end-to-end and learn why.
Evaluating Agent Performance: SWE-bench, WebArena, GAIA
Numbers on leaderboards are seductive and often wrong. Learn the big benchmarks, their leaderboard positions, their recently-exposed cheats, and how to run your own evals.
Production Agent Patterns: Queues, Retries, Idempotency
A prototype agent and a production agent have the same LLM. What's different is everything around it — durable state, retries, idempotency, observability. The real engineering.
Capstone: Build and Ship a Real Agent
Everything comes together. Design, code, test, secure, and ship a production-quality agent with open-source code you can fork today.
Reading an Agent Trace
A trace is the full record of what an agent did and why.
ML Engineer in 2026: You Build the Tools Everyone Else Uses
Fine-tune, evaluate, serve, monitor. The ML engineer is the person who ships the models that now power medicine, law, and design. It is the highest-leverage engineering role.
Evaluating Prompt Performance: From Vibes to Metrics
You can't improve what you don't measure. Build an eval set, pick metrics, and turn prompt engineering from gut-feel into a rigorous discipline.
AI Monitoring Stack: From Metrics to Quality
AI monitoring requires more than uptime metrics. Quality monitoring, drift detection, and outcome tracking are the differentiation.
Evaluation
Testing a model's quality — beyond benchmarks, using real tasks and user feedback.
LLMOps
MLOps specifically for LLM-based applications — prompt versioning, eval, and inference ops.
LangChain
Open-source framework for chaining LLM calls, tools, and memory into apps.