Index

A

accelerators, AI Accelerators-Power consumption
- computational capabilities, Computational capabilities
- defined, What’s an accelerator?-What’s an accelerator?
- memory size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
- power consumption, Power consumption-Power consumption
active injection, Indirect prompt injection
adapter-based methods, PEFT techniques
adapters
- finetuning, Finetuning methods
- LoRA, LoRA-Quantized LoRA
- merging with concatenation, Concatenation
- PEFT techniques, PEFT techniques-PEFT techniques
agents, Agents-Efficiency
- agent failure modes and evaluation, Agent Failure Modes and Evaluation-Efficiency
  - efficiency, Efficiency
  - planning failures, Planning failures
  - tool failures, Tool failures
- overview, Agent Overview-Agent Overview
- planning agents, Planning-Tool selection
  - foundation models as planners, Foundation models as planners-Foundation models as planners
  - overview, Planning overview-Planning overview
  - plan generation, Plan generation-Complex plans
  - reflection and error correction, Reflection and error correction-Reflection and error correction
  - tool selection, Tool selection-Tool selection
- tools, Tools-Write actions
  - capability extension, Capability extension
  - knowledge augmentation, Knowledge augmentation
  - write actions, Write actions
AI accelerators (see accelerators)
AI application building (see application building)
AI application planning (see application planning)
AI engineering (AIE)
- defined, From Foundation Models to AI Engineering
- ML engineering versus, AI Engineering Versus ML Engineering-AI interface
- rise of AI engineering, The Rise of AI Engineering-From Foundation Models to AI Engineering
AI engineering architecture (see engineering architecture)
AI engineering stack (see engineering stack)
AI judge, AI as a Judge
- (see also AI-as-a-judge)
AI pipeline orchestration (see pipeline orchestration)
AI systems evaluation (see systems evaluation)
AI-as-a-judge, AI as a Judge-What Models Can Act as Judges?
- limitations, Limitations of AI as a Judge-Biases of AI as a judge
  - biases, Biases of AI as a judge
  - criteria ambiguity, Criteria ambiguity-Criteria ambiguity
  - inconsistency, Inconsistency
  - increased costs and latency, Increased costs and latency
- models, What Models Can Act as Judges?-What Models Can Act as Judges?
- reasons, Why AI as a Judge?
- reference-based, What Models Can Act as Judges?
- uses, How to Use AI as a Judge-How to Use AI as a Judge
AI-powered data synthesis (see data synthesis, AI-powered)
AMP (automatic mixed precision), Training quantization
ANN (approximate nearest neighbor), Embedding-based retrieval
Annoy (approximate nearest neighbors oh yeah), Embedding-based retrieval
anomaly detection, Similarity Measurements Against Reference Data
Anthropic
- contextual retrieval, Contextual retrieval
- inverse scaling and alignment training, Model Size
- prompt caching, Prompt caching
- RAG and, RAG
APIs (see open source models, model APIs versus)
application building, Introduction to Building AI Applications with Foundation Models-Summary
- application planning, Planning AI Applications-Maintenance
  - maintenance, Maintenance
  - milestone planning, Milestone Planning
  - set expectations, Setting Expectations
  - use case evaluation, Use Case Evaluation-AI product defensibility
- engineering stack, The AI Engineering Stack-AI Engineering Versus Full-Stack Engineering
  - AI engineering versus ML engineering, AI Engineering Versus ML Engineering-AI interface
  - application development, Application development-AI interface
  - full-stack engineering versus, AI Engineering Versus Full-Stack Engineering
  - three layers of AI stack, Three Layers of the AI Stack-Three Layers of the AI Stack
- foundation model use cases, Foundation Model Use Cases-Workflow Automation
  - coding, Coding-Coding
  - conversational bots, Conversational Bots
  - data organization, Data Organization
  - education, Education
  - image and video production, Image and Video Production
  - information aggregation, Information Aggregation
  - workflow automation, Workflow Automation
  - writing, Writing-Writing
- rise of AI engineering, The Rise of AI Engineering-From Foundation Models to AI Engineering
  - foundation models to AI engineering, From Foundation Models to AI Engineering-From Foundation Models to AI Engineering
application development, Three Layers of the AI Stack, Application development-AI interface
- AI interface, AI interface
- evaluation, Evaluation
- prompt engineering and context construction, Prompt engineering and context construction
application planning, Planning AI Applications-Maintenance
- maintenance, Maintenance
- milestone planning, Milestone Planning
- set expectations, Setting Expectations
- use case evaluation, Use Case Evaluation-AI product defensibility
approximate nearest neighbor (ANN), Embedding-based retrieval
approximate string matching, Lexical similarity
ARC-C, Public leaderboards
attention mechanisms, Attention mechanism-Attention mechanism
- attention modules, Transformer block
- MLP modules, Transformer block
- optimization, Attention mechanism optimization-Writing kernels for attention computation
  - attention mechanism redesign, Redesigning the attention mechanism
  - wiring kernels for attention computation, Writing kernels for attention computation
- redesign, Redesigning the attention mechanism
attention modules, Transformer block
augmentation of data
- defined, Data Augmentation and Synthesis
automated attacks, Automated attacks
automatic mixed precision (AMP), Training quantization
autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- inference with reference, Inference with reference
- parallel decoding, Parallel decoding
- speculative decoding, Speculative decoding-Speculative decoding
autoregressive language model, Language models

B

backpropagation, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
batch size, Batch size
batching
- batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
- batch size, Batch size
- continuous, Batching
- dynamic, Batching
- static, Batching
benchmarks
- for comparative evaluation, The Future of Comparative Evaluation
- data contamination detection, Perplexity Interpretation and Use Cases
- domain distribution and, Domain-Specific Models
- domain-specific, Domain-Specific Capability-Domain-Specific Capability
- instruction-following criteria, Instruction-following criteria-Instruction-following criteria
- model-centric versus data-centric, Dataset Engineering
- navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
biases, Biases of AI as a judge, Biases
bits-per-byte (BPB), Bits-per-Character and Bits-per-Byte
bits-per-character (BPC), Bits-per-Character and Bits-per-Byte
bottlenecks
- autoregressive decoding, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- computational, Computational bottlenecks-Computational bottlenecks
- compute-bound, Computational bottlenecks
- memory, Memory Bottlenecks-Training quantization, Computational bottlenecks
- scaling, Scaling bottlenecks-Scaling bottlenecks, Scalability bottlenecks
BPB (bits-per-byte), Bits-per-Character and Bits-per-Byte
BPC (bits-per-character), Bits-per-Character and Bits-per-Byte
build time, Comparing retrieval algorithms

C

canonical responses, Similarity Measurements Against Reference Data
capability extension, Capability extension
chain-of-thought (CoT), Give the Model Time to Think-Give the Model Time to Think, Data Curation
chaining, AI Pipeline Orchestration
change failure rate (CFR), Monitoring and Observability
CharacterEval, Roleplaying
ChatGPT
- comparative evaluation, Ranking Models with Comparative Evaluation
- data privacy issues, Data privacy
- effect on AI investment, From Foundation Models to AI Engineering
- Gemini versus, Evaluation
- hallucinations, Hallucination
- and human writing quality, Writing
- introduction of, Preface
- and languages other than English, Multilingual Models
- query rewriting, Query rewriting
- reverse prompt engineering attacks, Proprietary Prompts and Reverse Prompt Engineering
- in schools, Education
Chinchilla scaling law, Scaling law: Building compute-optimal models
chunking, RAG Architecture, Chunking strategy-Chunking strategy
Claude, RAG and, RAG
CLIP, From Large Language Models to Foundation Models, Domain-Specific Models, Introduction to Embedding
clustering, Similarity Measurements Against Reference Data
Common Crawl dataset, Training Data-Multilingual Models
comparative evaluation, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
comparison data, Reward model
compilers, Kernels and compilers
components definition, AI Pipeline Orchestration
computational bottlenecks, Computational bottlenecks-Computational bottlenecks
computational capabilities, of AI accelerators, Computational capabilities
compute-bound bottlenecks, Computational bottlenecks
compute-optimal models, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
compute-optimal training, Scaling law: Building compute-optimal models
concatenation, Concatenation
constrained sampling, Constrained sampling
context construction, Prompt engineering and context construction, Provide Sufficient Context, Step 1. Enhance Context
context efficiency, Context Length and Context Efficiency-Context Length and Context Efficiency
context length, Context Length and Context Efficiency-Context Length and Context Efficiency
context parallelism, Parallelism
context precision, Comparing retrieval algorithms
context recall, Comparing retrieval algorithms
contextual retrieval, Contextual retrieval-Contextual retrieval
continuous batching, Batching
control flow, Complex plans
conversational bots, Conversational Bots
conversational feedback
- conversation length, Conversation length
- conversation organization, Conversation organization
- extracting, Extracting Conversational Feedback-Dialogue diversity
- language diversity, Dialogue diversity
- natural language feedback, Natural language feedback-Sentiment
  - complaints, Complaints
  - early termination, Early termination
  - error correction, Error correction
  - sentiment, Sentiment
- regeneration, Regeneration
copyright regurgitation, Information Extraction
copyright, model training and, Data lineage and copyright
CoT (chain-of-thought), Give the Model Time to Think-Give the Model Time to Think
CPU memory (DRAM), Memory size and bandwidth
criteria ambiguity, Criteria ambiguity-Criteria ambiguity
cross entropy, Cross Entropy
cross-layer attention, Redesigning the attention mechanism

D

data annotation, Data Acquisition and Annotation-Data Acquisition and Annotation
- and data curation, Data Curation-Data Acquisition and Annotation
- and data inspection, Inspect Data
- dataset engineering and, Dataset engineering
data augmentation, Data Augmentation and Synthesis-Model Distillation
- defined, Data Augmentation and Synthesis
data cleaning/filtering, Clean and Filter Data
data contamination, Data contamination with public benchmarks-Handling data contamination
data coverage, Data Coverage-Data Coverage
data curation, Data Curation-Data Acquisition and Annotation
data deduplication, Similarity Measurements Against Reference Data, Deduplicate Data-Deduplicate Data
data flywheels, Data Acquisition and Annotation
data formatting, Format Data-Format Data
data inspection, Inspect Data-Inspect Data
data lineage, Data lineage and copyright
data organization, Data Organization
data privacy, Data privacy
data processing, Data Processing-Format Data
- data cleaning/filtering, Clean and Filter Data
- data formatting, Format Data-Format Data
- deduplicating data, Deduplicate Data-Deduplicate Data
- inspecting data, Inspect Data-Inspect Data
data synthesis, Data Augmentation and Synthesis-Model Distillation
- AI-powered, AI-Powered Data Synthesis-Obscure data lineage
  - data verification, Data verification-Data verification
  - instruction data synthesis, Instruction data synthesis-Instruction data synthesis
  - limitations, Limitations to AI-generated data-Obscure data lineage
  - obscure data lineage problems, Obscure data lineage
  - potential model collapse, Potential model collapse
  - quality control problems, Quality control
  - reasons for synthesizing data, Why Data Synthesis-Why Data Synthesis
  - superficial imitation problems, Superficial imitation
- model distillation, Model Distillation
- traditional techniques, Traditional Data Synthesis Techniques-Simulation
  - rule-based, Rule-based data synthesis-Rule-based data synthesis
  - simulation, Simulation
data verification, Data verification-Data verification
dataset engineering, Dataset engineering, Dataset Engineering-Summary
- data augmentation/synthesis, Data Augmentation and Synthesis-Model Distillation
- data curation, Data Curation-Data Acquisition and Annotation
  - data acquisition/annotation, Data Acquisition and Annotation-Data Acquisition and Annotation
  - data coverage, Data Coverage-Data Coverage
  - data quality, Data Quality-Data Quality
  - data quantity, Data Quantity-Data Quantity
- data processing, Data Processing-Format Data
  - data cleaning and filtering, Clean and Filter Data
  - data formatting, Format Data-Format Data
  - deduplicating data, Deduplicate Data-Deduplicate Data
  - inspecting data, Inspect Data-Inspect Data
- data-centric view of AI, Dataset Engineering
DDR SDRAM (doubled data rate synchronous dynamic random-access memory), Memory size and bandwidth
debugging, Break Complex Tasks into Simpler Subtasks
decoding
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- decoupling from prefilling, Decoupling prefill and decode
- in transformer architecture, Transformer architecture
defensive prompt engineering
- jailbreaking and prompt injection, Jailbreaking and Prompt Injection-Indirect prompt injection
  - automated attacks, Automated attacks
  - direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
  - indirect prompt injection, Indirect prompt injection-Indirect prompt injection
- prompt attack defense, Defenses Against Prompt Attacks-System-level defense
  - model-level defense, Model-level defense
  - prompt-level defense, Prompt-level defense
  - system-level defense, System-level defense
degenerate feedback loops, Degenerate feedback loop
demonstration data, Supervised Finetuning
dense retrievers, Retrieval Algorithms
dimensionality reduction, Deduplicate Data
direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
Direct Preference Optimization (DPO), Preference Finetuning
distillation, Reasons to Finetune
- base, Base models
- model distillation, Open source, open weight, and model licenses, Model Distillation, Model compression
- synthetic data and, Why Data Synthesis
domain-specific capability, Domain-Specific Capability-Domain-Specific Capability
domain-specific task finetuning, Reasons Not to Finetune
domain-specific training data models, Domain-Specific Models-Domain-Specific Models
dot products, Attention mechanism
doubled data rate synchronous dynamic random-access memory (DDR SDRAM), Memory size and bandwidth
DPO (Direct Preference Optimization), Preference Finetuning
DRAM (CPU memory), Memory size and bandwidth
drift detection, Drift detection
dynamic batching, Batching
dynamic features, The role of AI and humans in the application

E

edit distance, Lexical similarity
Elo, Ranking Models with Comparative Evaluation, Scalability bottlenecks, Quantized LoRA
embedding, Introduction to Embedding-Introduction to Embedding
embedding algorithm, Semantic similarity, Introduction to Embedding
embedding model, From Large Language Models to Foundation Models
- embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
- multimodal RAG and, Multimodal RAG
embedding models, Introduction to Embedding
engineering architecture, AI Engineering Architecture-AI Pipeline Orchestration
- AI pipeline orchestration, AI Pipeline Orchestration-AI Pipeline Orchestration
- monitoring and observability, Monitoring and Observability-Drift detection
  - drift detection, Drift detection
  - logs and traces, Logs and traces-Logs and traces
  - metrics, Metrics-Metrics
- monitoring versus observability, Monitoring and Observability
- step 1: enhancing context, Step 1. Enhance Context
- step 2: putting in guardrails, Step 2. Put in Guardrails-Guardrail implementation
  - guardrail implementation, Guardrail implementation
  - input guardrails, Input guardrails-Input guardrails
  - output guardrails, Output guardrails-Output guardrails
- step 3: adding model router and gateway, Step 3. Add Model Router and Gateway-Gateway
  - gateway, Gateway-Gateway
  - router, Router-Router
- step 4: reducing latency with caches, Step 4. Reduce Latency with Caches-Semantic caching
  - exact caching, Exact caching
  - semantic caching, Semantic caching
- step 5: adding agent patterns, Step 5. Add Agent Patterns
engineering stack, Three Layers of the AI Stack-Three Layers of the AI Stack
- application development, Three Layers of the AI Stack
  - AI interface, AI interface
  - evaluation, Evaluation
  - prompt engineering and context construction, Prompt engineering and context construction
- infrastructure, Three Layers of the AI Stack
- ML engineering versus, Model development-Inference optimization
- model development, Three Layers of the AI Stack
entropy, Entropy
epochs, Number of epochs
error correction, Reflection and error correction-Reflection and error correction
evaluation, Evaluation
evaluation harnesses, Navigate Public Benchmarks
evaluation methodology, Evaluation Methodology-Summary
- AI as a judge, AI as a Judge-What Models Can Act as Judges?
- AI systems evaluation (see systems evaluation)
- challenges, Challenges of Comparative Evaluation-From comparative performance to absolute performance
- challenges of foundation model evaluation, Challenges of Evaluating Foundation Models-Challenges of Evaluating Foundation Models
  - comparative performance to absolute performance, From comparative performance to absolute performance
  - lack of standardization and quality control, Lack of standardization and quality control-Lack of standardization and quality control
  - scalability bottlenecks, Scalability bottlenecks
- exact evaluation, Exact Evaluation-Introduction to Embedding
- future, The Future of Comparative Evaluation
- language model for computing text perplexity, Perplexity Interpretation and Use Cases
- language modeling metrics, Understanding Language Modeling Metrics-Perplexity Interpretation and Use Cases
- rank models with comparative evaluation, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
evaluation pipeline design, Design Your Evaluation Pipeline-Iterate
- step 1: creating an evaluation guideline, Step 2. Create an Evaluation Guideline -Tie evaluation metrics to business metrics
- step 2: evaluating all components in a system, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
  - creating scoring rubrics with examples, Create scoring rubrics with examples
  - defining evaluation criteria, Define evaluation criteria
  - tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
- step 3: defining evaluation methods and data, Step 3. Define Evaluation Methods and Data-Iterate
  - annotating evaluation data, Annotate evaluation data-Annotate evaluation data
  - evaluating evaluation pipeline, Evaluate your evaluation pipeline
  - iteration, Iterate
  - selecting evaluation methods, Select evaluation methods
evaluation-driven development, Evaluation Criteria-Evaluation Criteria
eviction policies, Exact caching
exact caching, Exact caching
exact evaluation, Exact Evaluation-Introduction to Embedding
- functional correctness, Functional Correctness-Functional Correctness
- similarity measurements against reference data, Similarity Measurements Against Reference Data-Semantic similarity
exact matches, Exact match
expectation setting, Setting Expectations
explicit feedback, Extracting Conversational Feedback-Dialogue diversity

F

factual consistency, Factual consistency-Factual consistency, Create scoring rubrics with examples
faithfulness, Generation Capability
feature-based transfers, Finetuning, Finetuning Overview
feature-free transfers, Finetuning
federated learning, Model Merging and Multi-Task Finetuning
feedback design
- how to collect feedback, How to collect feedback-How to collect feedback
- when to collect feedback
  - in the beginning, In the beginning
  - when something bad happens, When something bad happens
  - when the model has low confidence, When the model has low confidence-When the model has low confidence
feedforward computation, Parallelism
feedforward layer, Transformer block, LoRA configurations
few-shot learning, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
finetuning, Finetuning-Summary
- defined, Modeling and training
- domain-specific tasks, Reasons Not to Finetune
- finetuning and RAG, Finetuning and RAG-Finetuning and RAG
- hyperparameters, Finetuning hyperparameters-Prompt loss weight
  - batch size, Batch size
  - learning rate, Learning rate
  - number of epochs, Number of epochs
  - prompt loss rate, Prompt loss weight
- memory bottlenecks, Memory Bottlenecks-Training quantization
  - backpropagation and trainable parameters, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
  - memory math, Memory Math-Memory needed for training
  - numerical representations, Numerical Representations-Numerical Representations
  - quantization, Quantization-Training quantization
- overview, Finetuning Overview-Finetuning Overview
- structured outputs, Finetuning
- tactics, Finetuning Tactics-Prompt loss weight
- techniques, Finetuning Techniques-Prompt loss weight
  - LoRA, LoRA-Quantized LoRA
  - model merging and multi-task finetuning, Model Merging and Multi-Task Finetuning-Concatenation
  - parameter-efficient finetuning, Parameter-Efficient Finetuning-Quantized LoRA
  - PEFT techniques, PEFT techniques-PEFT techniques
- when to finetune, When to Finetune-Finetuning and RAG
  - reasons not to finetune, Reasons Not to Finetune-Reasons Not to Finetune
  - reasons to finetune, Reasons to Finetune
FLOP (floating point operation), Model Size
foundation models, From Foundation Models to AI Engineering, Understanding Foundation Models-Summary
- evaluation challenges, Challenges of Evaluating Foundation Models-Challenges of Evaluating Foundation Models
  - comparative performance to absolute performance, From comparative performance to absolute performance
  - lack of standardization and quality control, Lack of standardization and quality control-Lack of standardization and quality control
  - scalability bottlenecks, Scalability bottlenecks
- inverse scaling, Model Size
- modeling, Modeling-Scaling bottlenecks
  - model architecture, Model Architecture-Other model architectures
  - model size, Model Size-Scaling bottlenecks
- parameter versus hyperparameter, Scaling extrapolation
- post-training, Post-Training-Finetuning using the reward model
  - preference finetuning, Preference Finetuning-Finetuning using the reward model
  - supervised finetuning, Supervised Finetuning-Supervised Finetuning
- sampling, Sampling-Hallucination
  - probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
  - sampling fundamentals, Sampling Fundamentals-Sampling Fundamentals
  - sampling strategies, Sampling Strategies-Stopping condition
  - structured outputs, Structured Outputs-Finetuning
  - test time compute, Test Time Compute-Test Time Compute
- training data, Training Data-Domain-Specific Models
  - domain-specific models, Domain-Specific Models-Domain-Specific Models
  - multilingual models, Multilingual Models-Multilingual Models
- use cases, Foundation Model Use Cases-Workflow Automation
  - coding, Coding-Coding
  - conversational bots, Conversational Bots
  - data organization, Data Organization
  - education, Education
  - image and video production, Image and Video Production
  - workflow automation, Workflow Automation
  - writing, Writing-Writing
full finetuning, Parameter-Efficient Finetuning-Quantized LoRA
function calling, Function calling-Function calling
fuzzy matching, Lexical similarity

G

gateways, Gateway-Gateway
Gemini, Evaluation, Test Time Compute, Prompt caching, When the model has low confidence
generation capability, Generation Capability-Safety
global factual consistency, Factual consistency
goodput, Throughput and goodput-Throughput and goodput
GPU on-chip SRAM, Memory size and bandwidth
ground truths, Similarity Measurements Against Reference Data
grouped-query attention, Redesigning the attention mechanism
guardrail implementation, Guardrail implementation
guardrails, Control, access, and transparency, System-level defense, Step 2. Put in Guardrails-Guardrail implementation

H

H3 architecture, Other model architectures
hallucinations
- causes of, Hallucination-Hallucination
- defined, The Probabilistic Nature of AI
- and finetuning, Finetuning and RAG
- measurement, Factual consistency
- metrics for, Metrics
- superficial imitation and, Superficial imitation
hard attributes, Model Selection Workflow
hashing, Deduplicate Data
HellaSwag, Public leaderboards
hierarchical navigable small world (HNSW), Embedding-based retrieval
high-bandwidth memory (HBM), Memory size and bandwidth
hyperparameters, Scaling extrapolation, Finetuning hyperparameters-Prompt loss weight

I

IDF (inverse document frequency), Term-based retrieval
IFEval, Instruction-following criteria
implicit feedback, Extracting Conversational Feedback
in-context learning, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
inconsistency, Inconsistency-Inconsistency, Inconsistency
indexing
- chunking strategy and, Chunking strategy-Chunking strategy
- defined, RAG Architecture
- with embedding-based retrieval, Embedding-based retrieval
- retrieval systems and, Comparing retrieval algorithms
indirect prompt injection, Indirect prompt injection-Indirect prompt injection
inference APIs, Online and batch inference APIs-Online and batch inference APIs
inference optimization, Inference optimization, Inference Optimization-Summary
- AI accelerators
  - computational capabilities, Computational capabilities
  - defined, What’s an accelerator?-What’s an accelerator?
  - memory size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
  - power consumption, Power consumption-Power consumption
- case study from PyTorch, Kernels and compilers
- inference overview
  - computational bottlenecks, Computational bottlenecks-Computational bottlenecks
  - online and batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
  - latency, TTFT, and TPOT, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
  - throughput/goodput, Throughput and goodput-Throughput and goodput
  - utilization, MFU, and MBU, Utilization, MFU, and MBU-Utilization, MFU, and MBU
- inference service optimization, Inference Service Optimization-Parallelism
  - batching, Batching
  - decoupling prefill and decode, Decoupling prefill and decode
  - parallelism, Parallelism-Parallelism
  - prompt caching, Prompt caching-Prompt caching
- KV cache size calculation, Attention mechanism optimization
- memory-bound versus bandwidth-bound interference, Computational bottlenecks
- at model/hardware/service levels, Inference Optimization
- model optimization, Model Optimization-Kernels and compilers
  - attention mechanism optimization, Attention mechanism optimization-Writing kernels for attention computation
  - autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
  - kernels and compilers, Kernels and compilers-Kernels and compilers
  - model compression, Model compression
- understanding, Understanding Inference Optimization-Power consumption
  - AI accelerators, AI Accelerators-Power consumption
  - inference overview, Inference Overview-Online and batch inference APIs
  - inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- latency, TTFT, and TPOT, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- throughput/goodput, Throughput and goodput-Throughput and goodput
- utilization, MFU, and MBU, Utilization, MFU, and MBU-Utilization, MFU, and MBU
inference quantization, Inference quantization-Inference quantization
inference service
- defined, Open source models versus model APIs
- and inference optimization, Inference Overview
- throughput/goodput, Throughput and goodput-Throughput and goodput
inference service optimization, Inference Service Optimization-Parallelism
- decoupling prefill and decode, Decoupling prefill and decode
- parallelism, Parallelism-Parallelism
- prompt caching, Prompt caching-Prompt caching
inference with reference, Inference with reference
INFOBench, Instruction-following criteria
information aggregation, Information Aggregation
information extraction, Information Extraction-Information Extraction
information retrieval optimization, Retrieval Optimization-Contextual retrieval
- chunking strategy, Chunking strategy-Chunking strategy
- contextual retrieval, Contextual retrieval-Contextual retrieval
- query rewriting, Query rewriting
- reranking, Reranking
instruction data synthesis, Instruction data synthesis-Instruction data synthesis
instruction-following capability, Instruction-Following Capability-Roleplaying
instruction-following criteria, Instruction-following criteria-Instruction-following criteria
intent classifiers, Router
inter-token latency (ITL), Latency, TTFT, and TPOT
interface, AI, AI interface
internal knowledge, Memory
inverse document frequency (IDF), Term-based retrieval
inverted file index (IVF), Embedding-based retrieval
iteration, Iterate

J

jailbreaking, Jailbreaking and Prompt Injection-Indirect prompt injection
- automated attacks, Automated attacks
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
Jamba architecture, Other model architectures
judges (see AI judges)

K

k-nearest neighbors (k-NN), Embedding-based retrieval
kernels, Writing kernels for attention computation, Kernels and compilers-Kernels and compilers
key vector (K), Attention mechanism
key-value (KV) cache, Attention mechanism optimization-Optimizing the KV cache size
key-value vectors, Memory needed for inference
knowledge augmentation, Knowledge augmentation
knowledge-augmented verification, Factual consistency
KV cache (see key-value cache)

L

LangChain, Evaluate Prompt Engineering Tools, Prompt-level defense, Memory
language modeling metrics, Understanding Language Modeling Metrics-Perplexity Interpretation and Use Cases
- bits-per-byte, Bits-per-Character and Bits-per-Byte
- bits-per-character, Bits-per-Character and Bits-per-Byte
- cross entropy, Cross Entropy
- entropy, Entropy
- perplexity, Perplexity
- perplexity interpretation and use cases, Perplexity Interpretation and Use Cases-Perplexity Interpretation and Use Cases
language models, Language models-Language models, Perplexity Interpretation and Use Cases
large language models, From Large Language Models to Foundation Models-From Large Language Models to Foundation Models
- AI product defensibility, AI product defensibility
- role of AI and humans in the application, The role of AI and humans in the application-The role of AI and humans in the application
- set expectations, AI product defensibility
large multimodal model (LMM), From Large Language Models to Foundation Models
latency
- AI judges and, Increased costs and latency
- inference performance and, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- metrics, Setting Expectations
- reliability versus, Guardrail implementation
layer stacking, Layer stacking-Layer stacking
leaderboards, Scalability bottlenecks-Lack of standardization and quality control, Benchmark selection and aggregation-Custom leaderboards with public benchmarks
learning rate, Learning rate
leniency bias, Biases
lexical similarity, Lexical similarity-Lexical similarity
linear combination summing, Linear combination-Linear combination
Llama
- attention function, Attention mechanism
- data coverage, Data Coverage
- data quality, Data Quality
- data quantity, Data Quantity
- data synthesis, AI-Powered Data Synthesis, Instruction data synthesis
- finetuning, Finetuning Overview
- inference optimization, Kernels and compilers
- inference quantization, Inference quantization
- model distillation, Model Distillation
- open source models, Open source, open weight, and model licenses
- prefer, Preference Finetuning
- preference finetuning, Post-Training
- prompt template, System Prompt and User Prompt
- scaling law and, Scaling law: Building compute-optimal models
LLM-as-a-judge, AI as a Judge
- (see also AI-as-a-judge)
LMM (large multimodal model), From Large Language Models to Foundation Models
local factual consistency, Factual consistency
locality-sensitive hashing (LSH), Embedding-based retrieval
logit vectors, Sampling Fundamentals
logprobs, Temperature, Select evaluation methods
logs, Logs and traces-Logs and traces
long-term memory, Memory
loop tiling, Kernels and compilers
LoRA (low-rank adaptation), LoRA-Quantized LoRA
- configurations, LoRA configurations-LoRA configurations
- LoRA adapters service, Serving LoRA adapters-Serving LoRA adapters
- mechanism of operation, Why does LoRA work?
- quantized LoRA (QLoRA), Quantized LoRA-Quantized LoRA
low-rank factorization, LoRA
LSH (locality-sensitive hashing), Embedding-based retrieval

M

Mamba architecture, Other model architectures
manual generation, Traditional Data Synthesis Techniques-Simulation
masked language models, Language models
Massive Multitask Language Understanding (MMLU), Maintenance, Public leaderboards
matches, Ranking Models with Comparative Evaluation
MBU (model bandwidth utilization), Utilization, MFU, and MBU-Utilization, MFU, and MBU
MCQs (multiple-choice questions), Domain-Specific Capability
mean time to detection (MTTD), Monitoring and Observability
mean time to response (MTTR), Monitoring and Observability
memory, Memory-Memory
- internal knowledge, Memory
- long-term memory, Memory
- short-term memory, Memory
memory bottlenecks, Memory Bottlenecks-Training quantization
- bandwidth-bound, Computational bottlenecks
- memory math, Memory Math-Memory needed for training
  - memory needed for inference, Memory needed for inference
  - memory needed for training, Memory needed for training-Memory needed for training
- quantization, Quantization-Training quantization
  - inference quantization, Inference quantization-Inference quantization
  - training quantization, Training quantization-Training quantization
- size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
memory math, Memory Math-Memory needed for training
metrics, Metrics-Metrics
- correlations between, Evaluate your evaluation pipeline
- for AI as a judge, Criteria ambiguity-Criteria ambiguity
- for generation capability, Generation Capability
- for hallucination measurement, Factual consistency
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- language modeling (see language modeling metrics)
- observability metrics, Monitoring and Observability
- reference-based versus reference-free, Similarity Measurements Against Reference Data
- tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
- usefulness thresholds, Setting Expectations
MFU (model FLOPs utilization), Utilization, MFU, and MBU-Utilization, MFU, and MBU
milestone planning, Milestone Planning
mixture-of-experts (MoE) models, Model Size, Layer stacking
ML engineering, AI engineering versus, AI Engineering Versus ML Engineering-AI interface
MLP modules, Transformer block
MMLU (Massive Multitask Language Understanding), Maintenance, Public leaderboards
model APIs, open source models versus (see open source models, model APIs versus)
model architecture, Model Architecture-Other model architectures
- (see also specific architectures, e.g.: transformer architecture)
model bandwidth utilization (MBU), Utilization, MFU, and MBU-Utilization, MFU, and MBU
model compression, Model compression
model development, Three Layers of the AI Stack, Model development-Inference optimization
- dataset engineering, Dataset engineering
- inference optimization, Inference optimization-Inference optimization
- modeling and training, Modeling and training-Modeling and training
model distillation, Model Distillation
model FLOPs utilization (MFU), Utilization, MFU, and MBU-Utilization, MFU, and MBU
model inference, Maintenance
model merging, Model Merging and Multi-Task Finetuning-Concatenation
- concatenation, Concatenation
- layer stacking, Layer stacking-Layer stacking
- summing, Summing-Pruning redundant task-specific parameters
model optimization, Model Optimization-Kernels and compilers
- attention mechanism optimization, Attention mechanism optimization-Writing kernels for attention computation
  - attention mechanism redesign, Redesigning the attention mechanism
  - KV cache size optimization, Optimizing the KV cache size
  - write kernels for attention computation, Writing kernels for attention computation
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
  - inference with reference, Inference with reference
  - parallel decoding, Parallel decoding
  - speculative decoding, Speculative decoding-Speculative decoding
- kernels and compilers, Kernels and compilers-Kernels and compilers
- model compression, Model compression
model ranking, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
model router, Step 3. Add Model Router and Gateway-Gateway
model selection, Model Selection-Handling data contamination
- model build versus buy, Model Build Versus Buy-On-device deployment
  - open source models versus model APIs, Open source models versus model APIs-On-device deployment
  - open source, open weight, and model licenses, Open source, open weight, and model licenses-Open source, open weight, and model licenses
- model selection workflow, Model Selection Workflow-Model Selection Workflow
- navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
  - benchmark selection and aggregation, Benchmark selection and aggregation
  - public leaderboards, Public leaderboards
model size, Model Size-Scaling bottlenecks
- scaling bottlenecks, Scaling bottlenecks-Scaling bottlenecks
- scaling extrapolation, Scaling extrapolation
- scaling law: building compute-optimal models, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
model-centric AI, Dataset Engineering
model-level defense, Model-level defense
modeling, Modeling-Scaling bottlenecks
- model architecture, Model Architecture-Other model architectures
- model size, Model Size-Scaling bottlenecks
MoE (mixture-of-experts) models, Layer stacking
monitoring, Break Complex Tasks into Simpler Subtasks, Monitoring and Observability-Drift detection
MTTD (mean time to detection), Monitoring and Observability
MTTR (mean time to response), Monitoring and Observability
multi-query attention, Redesigning the attention mechanism
multi-task finetuning, Model Merging and Multi-Task Finetuning
multilingual training data models, Multilingual Models-Multilingual Models
multimodal models, From Large Language Models to Foundation Models
multiple-choice questions (MCQs), Domain-Specific Capability

N

n-gram similarity, Lexical similarity
natural language feedback, Natural language feedback-Sentiment
- complaints, Complaints
- early termination, Early termination
- error correction, Error correction
- sentiment, Sentiment
natural language generation (NLG), Generation Capability-Safety
natural language processing (NLP), Generation Capability-Safety
needle in a haystack (NIAH) test, Context Length and Context Efficiency

O

obscure data lineage, Obscure data lineage
observability, Monitoring and Observability-Drift detection
on-device deployment, On-device deployment
online inference APIs, Online and batch inference APIs-Online and batch inference APIs
Open CLIP, Domain-Specific Models
open source licenses, Open source, open weight, and model licenses-Open source, open weight, and model licenses
open source models, model APIs versus, Open source models versus model APIs-On-device deployment
- API cost versus engineering cost, API cost versus engineering cost
- control, access, and transparency, Control, access, and transparency
- data lineage and copyright, Data lineage and copyright
- data privacy, Data privacy
- functionality, Functionality
- on-device deployment, On-device deployment
- performance, Performance
open weight models, Open source, open weight, and model licenses
OpenAI
- batch APIs, Online and batch inference APIs
- evaluation harnesses, Navigate Public Benchmarks
- first GPT model, Self-supervision
- instruction hierarchy for model-level defense, Model-level defense
- model as a service, From Foundation Models to AI Engineering
- natural language supervision, From Large Language Models to Foundation Models
- open source APIs, Open source models versus model APIs
- progression/distillation paths, Base models
- quality of updated models, Custom leaderboards with public benchmarks
- test time compute, Test Time Compute
operator fusion, Kernels and compilers
optimization
- inference optimization (see inference optimization)
- of retrieval systems, Retrieval Optimization-Contextual retrieval

P

pairwise comparison, Deduplicate Data
parallel decoding, Parallel decoding
parallelism, Parallelism-Parallelism
parallelization, Break Complex Tasks into Simpler Subtasks, Kernels and compilers
parameter-efficient finetuning, Parameter-Efficient Finetuning-Quantized LoRA
- adapter-based/soft-prompt techniques, PEFT techniques-PEFT techniques
- LoRA, LoRA-Quantized LoRA
  - configurations, LoRA configurations-LoRA configurations
  - how it works, Why does LoRA work?
  - LoRA adapters service, Serving LoRA adapters-Serving LoRA adapters
  - quantized LoRA, Quantized LoRA-Quantized LoRA
Pareto optimization, Cost and Latency
partial finetuning, Parameter-Efficient Finetuning
passive phishing, Indirect prompt injection
PEFT (see parameter-efficient finetuning)
perplexity, Perplexity-Perplexity Interpretation and Use Cases
perturbation, Rule-based data synthesis
pipeline orchestration, AI Pipeline Orchestration-AI Pipeline Orchestration
- monitoring and observability, Monitoring and Observability-Drift detection
  - drift detection, Drift detection
  - logs and traces, Logs and traces-Logs and traces
  - metrics, Metrics-Metrics
planning
- plan generation, Plan generation-Complex plans
  - complex plans, Complex plans
  - function calling, Function calling-Function calling
  - granularity, Planning granularity
- reflection and error correction, Reflection and error correction-Reflection and error correction
pointwise evaluation, Reward model, Ranking Models with Comparative Evaluation
position bias, Biases
post-processing, Prompting
post-training, Modeling and training, Post-Training-Finetuning using the reward model
- preference finetuning, Preference Finetuning-Finetuning using the reward model
- supervised finetuning, Supervised Finetuning-Supervised Finetuning
potential model collapse, Potential model collapse
power consumption, Power consumption-Power consumption
PPO (proximal policy optimization), Finetuning using the reward model
pre-training, Modeling and training
precision bits, Numerical Representations
preference bias, Biases
preference finetuning, Preference Finetuning-Finetuning using the reward model, Finetuning Overview
preference models, What Models Can Act as Judges?
prefilling, Transformer architecture
prefilling, decoupling from decoding, Decoupling prefill and decode
proactive features, The role of AI and humans in the application
probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
- hallucination, Hallucination-Hallucination
- inconsistency, Inconsistency-Inconsistency
- probabilistic definition, The Probabilistic Nature of AI-Hallucination
procedural generation, Traditional Data Synthesis Techniques-Simulation
product quantization, Embedding-based retrieval
prompt attacks, Defensive Prompt Engineering, Jailbreaking and Prompt Injection-Indirect prompt injection
- automated attacks, Automated attacks
- defense against, Defenses Against Prompt Attacks-System-level defense
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
prompt caching, Prompt caching-Prompt caching
prompt catalogs, Organize and Version Prompts
prompt engineering, Prompt Engineering-Summary
- basics, Introduction to Prompting-Context Length and Context Efficiency
  - context length and context efficiency, Context Length and Context Efficiency-Context Length and Context Efficiency
  - in-context learning: zero-shot and few-shot, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
- best practices, Prompt Engineering Best Practices-Organize and Version Prompts
  - break complex tasks into simpler subtasks, Break Complex Tasks into Simpler Subtasks-Break Complex Tasks into Simpler Subtasks
  - evaluating prompt engineering tools, Evaluate Prompt Engineering Tools-Evaluate Prompt Engineering Tools
  - give the model time to think, Give the Model Time to Think-Give the Model Time to Think
  - iterating on your prompts, Iterate on Your Prompts
  - organize and version prompts, Organize and Version Prompts-Organize and Version Prompts
  - provide sufficient context, Provide Sufficient Context
  - write clear and explicit instructions, Write Clear and Explicit Instructions
- defensive engineering, Defensive Prompt Engineering-System-level defense
  - information extraction, Information Extraction-Information Extraction
  - jailbreaking and prompt injection, Jailbreaking and Prompt Injection-Indirect prompt injection
  - prompt attacks defense, Defenses Against Prompt Attacks-System-level defense
  - proprietary prompts and reverse prompt engineering, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
- defined, Prompt engineering and context construction
- restricting model knowledge to its context, Provide Sufficient Context
- terminology ambiguity: prompt versus context, In-Context Learning: Zero-Shot and Few-Shot
prompt loss rate, Prompt loss weight
prompt optimization, Evaluate Prompt Engineering Tools
prompt versioning, Organize and Version Prompts-Organize and Version Prompts
prompt-level defense, Prompt-level defense
proprietary prompts, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
proximal policy optimization (PPO), Finetuning using the reward model
public leaderboards, Public leaderboards

Q

QAT (quantization-aware training), Training quantization
QLoRA (quantized LoRA), Quantized LoRA-Quantized LoRA
QPS (queries per second), Comparing retrieval algorithms
quality control, Quality control
quantization, Quantization-Training quantization
- inference quantization, Inference quantization-Inference quantization
- training quantization, Training quantization-Training quantization
quantization-aware training (QAT), Training quantization
quantized LoRA (QLoRA), Quantized LoRA-Quantized LoRA
queries per second (QPS), Comparing retrieval algorithms
query rewriting, Query rewriting
query vector (Q), Attention mechanism

R

RAG (retrieval-augmented generation), RAG-RAG with tabular data
- finetuning and, Finetuning and RAG-Finetuning and RAG
- RAG architecture, RAG Architecture
- RAG beyond texts, RAG Beyond Texts-RAG with tabular data
  - multimodal RAG, Multimodal RAG
  - RAG with tabular data, RAG with tabular data-RAG with tabular data
- retrieval algorithms, Retrieval Algorithms-Combining retrieval algorithms
  - combining, Combining retrieval algorithms
  - comparing, Comparing retrieval algorithms-Comparing retrieval algorithms
  - embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
  - term-based retrieval, Term-based retrieval-Term-based retrieval
- retrieval optimization, Retrieval Optimization-Contextual retrieval
  - chunking strategy, Chunking strategy-Chunking strategy
  - contextual retrieval, Contextual retrieval-Contextual retrieval
  - query rewriting, Query rewriting
  - reranking, Reranking
random feedback, Biases
range bits, Numerical Representations
ranking, Similarity Measurements Against Reference Data
rating algorithms, Ranking Models with Comparative Evaluation
reactive features, The role of AI and humans in the application
recall, Comparing retrieval algorithms
recurrent neural networks (RNNs), Transformer architecture
reference-based judges, What Models Can Act as Judges?
reference-based metrics, Similarity Measurements Against Reference Data
reference-free metrics, Similarity Measurements Against Reference Data
reflection, Reflection and error correction-Reflection and error correction
regeneration, Regeneration
reinforcement learning from human feedback (RLHF), Preference Finetuning-Finetuning using the reward model
relevance, Generation Capability
reliability, latency versus, Guardrail implementation
replica parallelism, Parallelism
reranking, Reranking
restricted weight, Open source, open weight, and model licenses
retrieval algorithms, Retrieval Algorithms-Combining retrieval algorithms
- combining, Combining retrieval algorithms
- comparing, Comparing retrieval algorithms-Comparing retrieval algorithms
- embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
- term-based retrieval, Term-based retrieval-Term-based retrieval
retrieval optimization
- chunking strategy, Chunking strategy-Chunking strategy
- contextual retrieval, Contextual retrieval-Contextual retrieval
- query rewriting, Query rewriting
- reranking, Reranking
retrieval-augmented generation (see RAG)
retrievers
- combining retrieval algorithms, Combining retrieval algorithms
- main functions, RAG Architecture
- multimodal RAG and, Multimodal RAG
- quality evaluation, Comparing retrieval algorithms
- sparse versus dense, Retrieval Algorithms
reverse prompt engineering, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
reward models, Reward model-Reward model, What Models Can Act as Judges?
RLHF (reinforcement learning from human feedback), Preference Finetuning-Finetuning using the reward model
RNNs (recurrent neural networks), Transformer architecture
RoleLLM, Roleplaying
roleplaying, Roleplaying-Roleplaying
routers, Router-Router
rule-based data synthesis, Rule-based data synthesis-Rule-based data synthesis

S

S4 architecture, Other model architectures
safety, Safety-Safety
safety, as evaluation criteria, Safety-Safety
sampling, Sampling-Hallucination
- probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
- sampling fundamentals, Sampling Fundamentals-Sampling Fundamentals
- sampling strategies, Sampling Strategies-Stopping condition
- strategies, Sampling Strategies-Stopping condition
  - stopping condition, Stopping condition
  - temperature, Temperature-Temperature
  - top-k, Top-k
  - top-p, Top-p
- structured outputs, Structured Outputs-Finetuning
- test time compute, Test Time Compute-Test Time Compute
scaling bottlenecks, Scaling bottlenecks-Scaling bottlenecks, Scalability bottlenecks
scaling extrapolation, Scaling extrapolation
scaling law, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
scoring rubrics, Create scoring rubrics with examples
self-evaluation, What Models Can Act as Judges?
self-supervision language models, Self-supervision-Self-supervision
self-verification, Factual consistency
semantic caching, Semantic caching
semantic similarity, Semantic similarity-Semantic similarity
sequence parallelism, Parallelism
sequential finetuning, Model Merging and Multi-Task Finetuning
SFT (supervised finetuning), Post-Training, Supervised Finetuning-Supervised Finetuning, Finetuning Overview
short-term memory, Memory
simulation, Simulation
simultaneous finetuning, Model Merging and Multi-Task Finetuning
SLERP (spherical linear interpolation), Spherical linear interpolation (SLERP)
slicing, Annotate evaluation data
soft attributes, Model Selection Workflow
soft prompt-based PEFT methods, PEFT techniques-PEFT techniques
sparse models, Model Size, Model compression
sparse retrievers, Retrieval Algorithms
speculative decoding, Speculative decoding-Speculative decoding
spherical linear interpolation (SLERP), Spherical linear interpolation (SLERP)
SQL queries, Agent Overview
static batching, Batching
static features, The role of AI and humans in the application
stopping condition, Stopping condition
structured data, Perplexity Interpretation and Use Cases, Memory
structured outputs, Structured Outputs-Finetuning
- constrained sampling, Constrained sampling
- finetuning, Finetuning
- post-processing, Prompting
summing, Summing-Pruning redundant task-specific parameters
- linear combination, Linear combination-Linear combination
- pruning redundant task-specific parameters, Pruning redundant task-specific parameters
- spherical linear interpolation (SLERP), Spherical linear interpolation (SLERP)
superficial imitation, Superficial imitation
supervised finetuning (SFT), Post-Training, Supervised Finetuning-Supervised Finetuning, Finetuning Overview
supervision, Self-supervision
synthesis of data (see data synthesis)
system components evaluation, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
- creating scoring rubrics with examples, Create scoring rubrics with examples
- defining evaluation criteria, Define evaluation criteria
- tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
system prompts, System Prompt and User Prompt-System Prompt and User Prompt
system-level defense, System-level defense
systems evaluation, Evaluate AI Systems-Summary
- evaluation criteria, Evaluation Criteria-Cost and Latency
  - cost and latency, Cost and Latency-Cost and Latency
  - domain-specific capability, Domain-Specific Capability-Domain-Specific Capability
  - evaluation-driven development, Evaluation Criteria-Evaluation Criteria
  - generation capability, Generation Capability-Safety
  - instruction-following capability, Instruction-Following Capability-Roleplaying
- evaluation pipeline design, Design Your Evaluation Pipeline-Iterate
  - step 1: creating an evaluation guideline, Step 2. Create an Evaluation Guideline -Tie evaluation metrics to business metrics
  - step 2: evaluating all components in a system, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
  - step 3: defining evaluation methods and data, Step 3. Define Evaluation Methods and Data-Iterate
- evaluation-driven development, Evaluation Criteria-Evaluation Criteria
- model selection, Model Selection-Handling data contamination
  - data contamination with public benchmarks, Data contamination with public benchmarks-Handling data contamination
  - model build versus buy, Model Build Versus Buy-On-device deployment
  - model selection workflow, Model Selection Workflow-Model Selection Workflow
  - navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
- OpenAI model quality, Custom leaderboards with public benchmarks

T

task-based evaluation, Step 1. Evaluate All Components in a System
temperature, Temperature-Temperature
term frequency (TF), Term-based retrieval
text-to-SQL, Structured Outputs, Functional Correctness, RAG with tabular data
throughput, Throughput and goodput-Throughput and goodput
time between tokens (TBT), Latency, TTFT, and TPOT
time per output token (TPOT), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
time to first token (TTFT), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
tokenization, Multilingual Models, Model Size, Bits-per-Character and Bits-per-Byte, Term-based retrieval, Chunking strategy
- defined, Language models
tokenizer, Chunking strategy
tokens, Language models, Model Size
tool use, Tool selection
top-k, Top-k
top-p, Top-p
TPOT (time per output token), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
traces, Logs and traces
trainable parameters, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
training, Modeling and training-Modeling and training
training data, Training Data-Domain-Specific Models
- domain-specific models, Domain-Specific Models-Domain-Specific Models
- multilingual models, Multilingual Models-Multilingual Models
training quantization, Training quantization-Training quantization
transfer learning, Finetuning Overview
transformer architecture, Transformer architecture-Transformer block
- attention mechanism, Attention mechanism-Attention mechanism
  - attention modules, Transformer block
  - MLP modules, Transformer block
- transformer blocks, Transformer block-Transformer block
  - attention modules, Transformer block
  - embedding modules, Transformer block
  - MLP modules, Transformer block
  - output layers, Transformer block
TruthfulQA, Public leaderboards
TTFT (time to first token), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
turn-based evaluation, Step 1. Evaluate All Components in a System

U

unstructured data, Data Organization, Memory
use case evaluation, Use Case Evaluation-AI product defensibility
usefulness threshold, Setting Expectations
user feedback, User Feedback-Degenerate feedback loop
- extracting conversational feedback, Extracting Conversational Feedback-Dialogue diversity
  - natural language feedback, Natural language feedback-Sentiment
  - other conversational feedback, Other conversational feedback-Dialogue diversity
- feedback design, Feedback Design-How to collect feedback
  - when to collect feedback, When to collect feedback
- feedback limitations, Feedback Limitations-Degenerate feedback loop
  - biases, Biases
  - degenerate feedback loops, Degenerate feedback loop

V

value vector (V), Attention mechanism
vector database, Embedding-based retrieval-Embedding-based retrieval
vectorization, Kernels and compilers
vocabulary, Perplexity Interpretation and Use Cases
- defined, Language models

W

WinoGrande, Public leaderboards
workflow automation, Workflow Automation
write actions, Write actions

Z

zero-shot learning, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot