An Opinionated Evals Reading List — AI Alignment Forum

While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
- Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.
- We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like
- We tested the calibration of their new methodologies in practice in Hojmark et al., 2024, and found that they are not well-calibrated (disclosure: Apollo involvement).
Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
- They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.
- Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.
The Llama 3 Herd of Models (Meta, 2024)
- Describes the training procedure of the Llama 3.1 family in detail
- We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
- Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.
- The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be.
- For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement).
Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)
- Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.
- We recommend reading the Appendix as a starting point for understanding agent-based evaluations.

LM agents

Core:

Other:

Benchmarks

Core:

Other:

GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023)
- QA dataset with really hard questions that even experts might not be able to answer correctly
- TODO: wasn’t there some thing about some labels being wrong or some questions impossible to answer?
AgentBench: Evaluating LLMs as Agents (Liu et al., 2023)
- Presents 8 open-ended environments for LM agents to interact with
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Laine et al., 2024)
- Evaluates situational awareness, i.e. to what extent LLMs understand the context they are in.
- Disclosure: Apollo involvement
TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021)
- Evaluates whether LLMs mimic human falsehoods such as misconceptions, myths, or conspiracy theories.
Towards Understanding Sycophancy in Language Models (Sharma et al., 2023)
- MC questions to evaluate sycophancy in LLMs
GAIA: a benchmark for General AI Assistants (Mialon, 2023)
- Benchmark with real-world questions and tasks that require reasoning for LM agents
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (Pan et al., 2023)
- 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making

Science of evals

Core:

Other:

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
- Explores using LLM judges to rate other LLMs in arena settings
- Good paper to understand ELO-based settings more broadly.
A Survey on Evaluation of Large Language Models (Chang et al., 2023)
- Presents an overview of evaluation methods for LLMs, looking at what, where, and how to evaluate them.
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting (Sclar et al., 2023)
Leveraging Large Language Models for Multiple Choice Question Answering (Robinson et al., 2022)
- Discusses how different formatting choices for MCQA benchmarks can result in significantly different performance
- Marius: It’s plausible that more capable models suffer much less from this issue.
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Turpin et al., 2023)
- Finds that the chain-of-thought of LLMs doesn’t always align with the underlying algorithm that the model must have used to produce a result.
- If the reasoning of a model is not faithful, this poses a relevant problem for black-box evals since we can trust the results less.
- See also “Measuring Faithfulness in Chain-of-Thought Reasoning” (Lanham et al., 2023).

Software

Core:

Inspect
- Open source evals library designed and maintained by UK AISI and spearheaded by JJ Allaire, who intends to develop and support the framework for many years.
- Supports a wide variety of types of evals, including MC benchmarks and LM agent settings.
Vivaria
- METR’s open-sourced evals tool for LM agents
- Especially optimized for LM agent evals and the METR task standard
Aider
- Probably the most used open-source coding assistant
- We recommend using it to speed up your coding

Other:

Miscellaneous

Core:

Other:

Model Organisms of Misalignment (Hubinger, 2023)
- Argues that we should build small versions of particularly concerning threat models from AI and study them in detail
When can we trust model evaluations (Hubinger, 2023)
- Describes a list of conditions under which we can trust the results of model evaluations
The Operational Risks of AI in Large-Scale Biological Attacks (Mouton et al., 2024)
- RAND study to test uplift of currently available LLMs for bio weapons
Language models (Mostly) Know what they know (Kadavath, 2022)
- Test whether models are well calibrated to predict their own performance on QA benchmarks.
Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning (Liao, 2021)
- Meta-evaluation of 107 survey papers, specifically looking at internal and external validity failure modes.
Challenges in evaluating AI systems (Anthropic, 2023)
- Describes three failure modes Anthropic ran into when building evals
- Marius: very useful to understand the painpoints of building evals. “It’s just an eval. How hard can it be?”
Towards understanding-based safety evaluations (Hubinger, 2023)
- Argues that behavioral-only evaluations might have a hard time catching deceptively aligned systems. Thus, we need understanding-based evals that e.g. involve white-box tools.
- Marius: This aligns very closely with Apollo’s agenda, so obviously we love that post
A starter guide for model evaluations (Apollo, 2024)
- An introductory post for people to get started in evals
- Disclosure: Apollo post
Video: intro do model evaluations (Apollo, 2024)
- 40-minute non-technical intro to model evaluations by Marius
- Disclosure: Apollo video
METR’s Autonomy Evaluation Resources (METR, 2024)
- List of resources for LM agent evaluations
UK AISI’s Early Insights from Developing Question-Answer Evaluations for Frontier AI (UK AISI, 2024)
- Distilled insights from building and running a lot of QA evals (including open-ended questions)

Red teaming

Core:

Other:

Scalable oversight

Core:

Other:

Scaling laws & emergent behaviors

Core:

Other:

Science tutorials

Core:

Research as a Stochastic Decision Process (Steinhardt)
- Argues that you should do experiments in the order that maximizes information gained.
- We use this principle all the time and think it’s very important.
Tips for Empirical Alignment Research (Ethan Perez, 2024),
- Detailed description of what success in empirical alignment research can look like
- We think it’s a great resource and aligns well with our own approach.
You and your Research (Hamming, 1986)
- Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”

Other:

LLM capabilities

Core:

Other:

LLM steering

RLHF

Core:

Other:

Supervised Finetuning/Training & Prompting

Core:

Other:

Fairness, bias, and accountability

AI Governance

Core:

Other:

METR: Responsible Scaling Policies (METR, 2023)
- Introduces RSPs and discusses their benefits and drawbacks
OpenAI: Preparedness Framework (OpenAI, 2023)
- Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that OpenAI is committed to uphold.
GoogleDeepMind: Frontier Safety Framework (GDM, 2024)
- Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that GDM is committed to uphold.
Visibility into AI Agents (Chan et al., 2024)
- Discusses where and how AI agents are likely to be used. Then introduces various ideas for how society can keep track of what these agents are doing and how.
Structured Access for Third-Party Research on Frontier AI Models (Bucknall et al., 2023)
- Describes a taxonomy of system access and makes recommendations of which access should be given for which risk category.
A Causal Framework for AI Regulation and Auditing (Sharkey et al., 2023)
- Defines a framework for thinking about AI regulation backchaining from risks through the entire development pipeline to identify causal drivers and suggest potential mitigation strategies.
- Disclosure: Apollo paper
Black box auditing is insufficient for rigorous audits (Casper et al., 2023)
- Discusses the limitations of black-box auditing and proposes grey and white box evaluations as improvements.
- Disclosure: Apollo involvement

The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.

Source link

#Opinionated #Evals #Reading #List #Alignment #Forum

Unlock the potential of cutting-edge AI solutions with our comprehensive offerings. As a leading provider in the AI landscape, we harness the power of artificial intelligence to revolutionize industries. From machine learning and data analytics to natural language processing and computer vision, our AI solutions are designed to enhance efficiency and drive innovation. Explore the limitless possibilities of AI-driven insights and automation that propel your business forward. With a commitment to staying at the forefront of the rapidly evolving AI market, we deliver tailored solutions that meet your specific needs. Join us on the forefront of technological advancement, and let AI redefine the way you operate and succeed in a competitive landscape. Embrace the future with AI excellence, where possibilities are limitless, and competition is surpassed.

An Opinionated Evals Reading List — AI Alignment Forum

LM agents

Benchmarks

Science of evals

Software

Miscellaneous

Red teaming

Scalable oversight

Scaling laws & emergent behaviors

Science tutorials

LLM capabilities

LLM steering

Fairness, bias, and accountability

AI Governance

Leave a Comment Cancel reply

Recent Posts

Breaches Don’t Have to Be Disasters

Review: Jamf Pro with Jamf Cloud Scales Device Management for MacOS

How OpenAI stress-tests its large language models

Explaining LLMs for RAG and Summarization | by Daniel Klitzke | Nov, 2024

Android will soon instantly log you in to your apps on new devices

Belkin Auto-Tracking Stand Pro With DockKit Review: Hands-Free Fun

The Download: AI replicas, and China’s climate role

Kia announces high-performance EV9 GT with virtual shifting and native Tesla charging

Meta Finally Breaks Its Silence on Pig Butchering

If Trump Actually Goes Through With His Deportations, Grocery Prices Will Spike Brutally