March 13, 2025

Experts in LLMs

When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

https://arxiv.org/pdf/2311.10054 Code Repo

(1) Does adding personas to system prompts help improve model performance on objective tasks? (2) Does the social construct of the persona affect model performance? (3) What factors could potentially explain the effect of personas on model performance? (4) Can we automatically identify the best roles for prompting? Through our analysis, we find that, in general, prompting with personas has no or small negative effects on model performance compared with the control setting where no persona is added.

Our study makes the following three contributions. First, we introduce a new pipeline to systematically evaluate LLMs’ performance when prompted with a wide range of personas. Second, our large-scale experiments reveal an important finding that prompting LLMs with personas might actually hurt their performance on objective tasks. Third, through analyzing a wide range of persona attributes and automatic role-searching strategies, we found that the effect of personas on model performance is not consistent across questions. While a certain persona may lead to the correct answer for each question, the presence of such personas is largely unpredictable.

Relevant Code (why?)

Ngram Website (interesting)

Review: really simple paper - did absolutely nothing much than answer questions with a bit of statistics. Nice base setup for an LLM project where we just want to see the final result of how accurate an LLM is.

Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks

https://arxiv.org/pdf/2408.08631

Look at that conf matrix comparing a neutral to a role based reply

Basic idea being - we get two solvers one with the persona and one without. Then if the selected response that we get from them is not the same we run an evaluator that tries to judge out of both the responses which one is the better response.

I would dig deeper if they had some code.

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts

https://arxiv.org/pdf/2305.14688

Given an input instruction $q$, an aligned LLM (such as ChatGPT or Claude) produces an output $a$, which is the model’s direct response to the instruction.
\[a = LLM(q)\]
The expert identity $e_q$ is created by conditioning the LLM on multiple relevant instruction-answer pairs. The operator $\oplus$ represents concatenation or combination of these pairs, allowing the model to generate an identity description that encapsulates expertise.
\[e_q = LLM(\{q_1, e_{q_1}\} \oplus \dots \oplus \{q_k, e_{q_k}\} \oplus q)\]
By providing both the expert identity $e_q$ and the original instruction $q$, the LLM is expected to generate an improved response $\hat{a}$, which should be more authoritative and accurate compared to $a$.
\[\hat{a} = LLM(\{e_q, q\})\]

most ulti-inducing paper was this one but I guess the way they got their roles was actually better - for very descripting answers (and this paper is basically just comparing the length of the response - sigh)

These papers lay out a base idea of what role experts play out in LLMs.

Knowledge Localization: Mission Not Accomplished? Enter Query Localization!

https://openreview.net/pdf?id=tfyHbvFZ0K

(Need to come back to this)

Thought Vectors

Blog Post

Vectors do have meaning - and it has been especially seen in like anything with encoders. Then the question ultimately becomes if vectors have some underlying meaning can we do some math with them and see the results ourselves?

Like when you create a photo: man in short hair wearing sunglasses, for example, a decomposition might look like

\[\text{Encoder}(x) \approx (2 \cdot d_{\text{smile}}) - (1.5 \cdot d_{\text{long-hair}}) + (4 \cdot d_{\text{sunglass}}) + (1 \cdot d_{\text{masculinity}})\]

(If we were to look at the same thing in LLMs what kind of added functionality are we looking at?)

Ablation Study Structure

Many papers have studied the effects of the presence of an expert in the prompt to improve the accuracy of an LM while answering questions. In our previous paper we have also seen an improvement [?]. In this paper we hope to understand what causes this improvement by looking at what components of the model architecture are responsible for this improvement using transcoders (While SAE features are often interpretable, they are typically dense combinations of many neurons, making it difficult to mechanistically trace how one feature influences another across layers. Previous work has used causal interventions and gradient-based techniques to study these interactions. Building on these ideas, we investigate how the presence of an expert in the prompt affects specific model features, aiming to identify which parts of the architecture drive the improvement in performance.).

Ablation study first (this first step is defined from SAE paper) We want to understand the direct and indirect effects of the presence of an expert in the prompt.

It states:

A classical example of the ubiquity of direct effects (Hesslow 1976) tells the story of a birth-control pill that is suspect of producing thrombosis in women and, at the same time, has a negative indirect effect on thrombosis by reducing the rate of pregnancies (pregnancy is known to encourage thrombosis). In this example, interest is focused on the direct effect of the pill because it represents a stable biological relationship that, unlike the total effect, is invariant to marital status and other factors that may affect women’s chances of getting pregnant or of sustaining pregnancy. This invariance makes the direct effect transportable across cultural and sociological boundaries and, hence, a more useful quantity in scientific explanation and policy analysis.

Taking this criterion as a guideline, the direct effect of $X$ on $Y$ (in our case $X=$gender $Y=$hiring) can roughly be defined as the response of $Y$ to change in $X$ (say from $X = x^\ast$ to $X = x$) while keeping all other accessible variables at their initial value, namely, the value they would have attained under $X = x^\ast$. This doubly-hypothetical criterion will be given precise mathematical formulation in Section 3, using the language and semantics of structural counterfactuals (Pearl 2000; chapter 7).

As a third example, one that illustrates the policymaking ramifications of direct and total effects, consider a drug treatment that has a side effect - headache. Patients who suffer from headache tend to take aspirin which, in turn may have its own effect on the disease or, may strengthen (or weaken) the impact of the drug on the disease. To determine how beneficial the drug is to the population as a whole, under existing patterns of aspirin usage, the total effect of the drug is the target of analysis, and the difference $P(Y_x = y) - P(Y_x^\ast = y)$ may serve to assist the decision, with $x$ and $x^\ast$ being any two treatment levels. However, to decide whether aspirin should be encouraged or discouraged during the treatment, the direct effect of the drug on the disease, both with aspirin and without aspirin, should be the target of investigation. The appropriate expression for analysis would then be the difference $P(Y_{xz} = y) - P(Y_{x^\ast z} = y)$, where $z$ stands for any specified level of aspirin intake.

Example: Controlled Direct Effect (Quantitative)

Suppose we are studying the effect of a medicine ($X$) on a patient’s health score ($Y$), while controlling for the patient’s diet ($Z$).

$X = 0$ means the patient doesm not take the medicine.
$X = 1$ means the patient takes the medicine.
$Z$ represents the patient’s die, and we fix it at a healthy setting.

We observe the following:

When the patient does not take the medicine ($X = 0$) and has a healthy diet ($Z = \text{healthy}$), the health score is:

\[Y_{0, \text{healthy}}(u) = 50\]

When the patient takes the medicine ($X = 1$) and still has a healthy diet, the health score is:

\[Y_{1, \text{healthy}}(u) = 80\]

The controlled direct effect of taking the medicine, while fixing the diet, is then calculated as:

\[CDE_{\text{healthy}}(0, 1; Y, u) = Y_{1, \text{healthy}}(u) - Y_{0, \text{healthy}}(u) = 80 - 50 = 30\]

Thus, the controlled direct effect is 30 points. This means that, holding the diet constant, taking the medicine increases the patient’s health score by 30 points.

We can also compute other measures:

Ratio (multiplicative effect):
- \[\frac{Y_{1, \text{healthy}}(u)}{Y_{0, \text{healthy}}(u)} = \frac{80}{50} = 1.6\]
- Thus, the health score under the medicine is 1.6 times the score without the medicine.
Proportional Difference (relative increase):
- \[\frac{Y_{1, \text{healthy}}(u) - Y_{0, \text{healthy}}(u)}{Y_{0, \text{healthy}}(u)} = \frac{80 - 50}{50} = \frac{30}{50} = 0.6\]
- Thus, taking the medicine increases the patient’s health score by 60%.

1. Possible Direct Effect

If the presence of the expert in the prompt directly enhances the model’s output quality (e.g., by triggering better-aligned weights or activating more relevant knowledge), this would be a direct effect.

Mechanism:
The explicit mention of an expert may steer the model toward higher-confidence reasoning paths or authoritative sources without intermediate steps.

Example:

Prompt:

“As a medical expert, explain X.”

Model behavior:

Model taps into medically trained patterns more directly.

Causal Path:
$\text{Expert cue } (X) \rightarrow \text{Better output } (Y)$

2. Possible Indirect Effect

If the expert reference improves performance by implicitly changing how the model processes intermediate steps (e.g., encouraging chain-of-thought reasoning or stricter verification), this would be an indirect effect.

Mechanism:
The expert label alters the model’s internal process (e.g., “I should be more careful/accurate”) rather than directly retrieving better answers.

Example:

Prompt:

“As a scientist, solve this step-by-step.”

Model behavior:

Model engages in more rigorous reasoning first, leading to better output.

Causal Path:
$\text{Expert cue } (X) \rightarrow \text{Deliberate reasoning } (M) \rightarrow \text{Better output } (Y)$

Key Consideration: Task Dependency

Direct Effect Likely:
If the task relies on retrieval (e.g., factual QA) and the expert cue activates domain-specific knowledge.
Indirect Effect Likely:
If the task requires reasoning (e.g., math, analysis) and the expert cue induces behavioral changes (e.g., showing work).

How to Test This?

Ablation Study:
Remove the expert reference but keep the other prompt structure identical.
If performance drops sharply, it suggests a direct role.
Process Tracing:
Analyze whether the expert cue changes the model’s intermediate steps (e.g., longer chains of thought).
If so, it’s likely an indirect effect.

Basically understanding both the direct and effect and the indirect effect - so the paper shall be divided into two parts.

We need to do a study in the mechanistic interpretability (Anthropic blog on In-context Learning and Induction Heads). For CNNs we can can sort of gauge what is happening using the weights in each of the layers and trying to interpret the features that (OpenAI blog on CNN interpretability). I think the start of all this requires a breakdown of the transformer architecture (A Mathematical Framework for Transformer Circuits)

[2025/4/7] Okay, so I am basically going to do another blog before this one and understand transformers then come back here.

https://www.youtube.com/watch?v=XfpMkf4rD6E&t=199s&ab_channel=StanfordOnline

https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy