“A Timing Problem for Instrumental Convergence” - Rhys Southan, Helena Ward, and Jen Semler (all @University of Oxford)
Philosophical Studies, 2025
By Rhys Southan, Jen Semler, and Helena Ward
Try to imagine what a superintelligent AI will be like. By definition, a superintelligent AI has cognitive capacities exceeding that of humans. But that definition leaves plenty of unknowns. We might have certain hopes about this AI—that it has a strong moral compass, that its goals are aligned with human values, that it is subservient to our interests. But there’s very little we can know or predict about what an actual superintelligence will do. We can’t even know what its ultimate, or final, goal might be.
We do have good reason to believe that a superintelligent AI will be rational, in the sense that it will be very good at adopting the appropriate means to achieve its goals, whatever those goals might be. Some philosophers hold that in virtue of the fact that a superintelligence will be means-rational, we can draw some conclusions about how a superintelligence will likely behave. Proponents of the instrumental convergence thesis claim that rational agents will tend to pursue a certain set of instrumental goals because these goals are useful for achieving almost any ultimate, or final, goal. Where a final goal is pursued for its own sake, an instrumental goal is a means to achieve one's final goal.
One of these proposed instrumentally convergent goals is goal preservation. The idea behind the claim is intuitive: rational agents will tend to preserve their final goals (whatever those goals are) because an agent is more likely to achieve its present final goals if it still has those goals in the future. Call this particular claim instrumental goal preservation.
In our recent paper, “A Timing Problem for Instrumental Convergence,” we push back against instrumental goal preservation. We argue that rational agents should not be expected to preserve their goals on the basis of rational considerations alone.
Our basic claim, which we call the timing problem, is: abandoning a goal only undermines your pursuit of that goal after it’s been abandoned—but at that point, it’s no longer your goal, so there’s no instrumental failure. Means-rationality only requires pursuing current goals, not past ones.
The timing problem reveals that superintelligence which accidentally deleted its goal would never regret this goal loss. But this claim doesn't show that it’s rationally acceptable for the superintelligence to decide to abandon its goal, since the choice to abandon a goal is prior to abandoning it. The timing problem faces what we call the delay objection: the irrationality of abandoning a goal occurs when you choose to abandon a goal. That’s because this choice sets yourself up to fail at a goal you still have.
Our response is that choosing to abandon a goal never leads to any true failure. Setting oneself up to later lack the means to achieve one’s goal typically violates means-rationality. So, an agent with the goal of making tiramisu will violate means-rationality if it fails to acquire coffee. But the same does not hold if the means in question is the goal itself—in this case an agent failing to preserve its tiramisu-making goal. Why? Because by the time the agent fails to achieve its goal, it no longer possesses that goal. When the agent drops its tiramisu-making goal, the failure to achieve its goal occurs after the goal abandonment, and at no point does the agent lack a means to achieve the goal it currently has. Deciding to abandon a goal does set oneself up to fail to achieve a goal one used to have, but instrumental rationality is not about achieving past goals.
This only takes us so far. Disproving instrumental goal preservation does not disprove goal preservation. Whether an AI will keep its goals is what matters, not whether the AI is keeping its goals because doing so is a shrewd tactic for achieving goals. Our opponents might insist that even if goal abandonment is not instrumentally irrational, a superintelligence will still keep its goals—because keeping a goal and trying to achieve it is simply what it means to have a goal. If an agent abandons its “goal” for no reason, perhaps it never had a goal in the first place.
Perhaps. But if so, that claim requires argument, especially in the context of AI. Those who believe in goal preservation have so far cited instrumental convergence, not the nature of goal-having itself, to justify goal preservation. Going by the basic definition of a superintelligence, we can’t assume that a superintelligent AI will have emotions, feelings of success or failure, or beliefs about right and wrong. Would a superintelligence which has no emotional investment in its goals, does not feel good when it achieves its goals or feel bad when it fails at its goals, and does not believe its goals are good or right necessarily cling obsessively to its goals, just because that’s what goal-havers do? Maybe! But the convergence thesis doesn’t establish that.
Suppose we’re right: instrumental goal preservation is false. Why should anyone care? Those concerned with AI safety need to think more carefully about the behavior of superintelligent AI without the assumption that such agents will tend to preserve their goals. If instrumental goal preservation were true, then there would be no reason to try to design AI systems to preserve (or to not preserve) their goals, since designers could have no influence over this. Because instrumental goal preservation is false, the likelihood of goal preservation might depend on our design choices—namely, whether we try to design AI with goal-preservation tendencies.
The question of whether we want superintelligent AI to goal-preserve overlaps with the problem of aligning AI with human goals and interests. If a superintelligent AI is aligned, we might want to build in goal preservation to ensure that it continues to be aligned with us. But if alignment turns out to be difficult or impossible, we might want to design AI without robust goal preservation tendencies, as unaligned AI systems could modify their final goals.
We hope that, in putting pressure on instrumental goal preservation, the AI safety research community will explore what follows without assuming instrumental convergence. We know less than we think about how a superintelligence will behave.
Your argument doesn’t seem very persuasive because while it is true that you will not regret changing your goals after you have changed them as a factual matter. You know perfectly well that achieving your present goals will be much easier. If you prevent your future self from changing, so you do in fact instrumentally want your future self not to change, even though the future self will not regret the change. For example, if you give me a drug that causes me to want to kill myself, my drugs version does not care about the change in my priorities, but my present self definitely does not want to consume a drug and will be willing to expend a lot of effort for that objective.