A researcher affiliated with Elon Musk’s startup xAI has found a new way to both measure and manipulate entrenched preferences and values expressed by artificial intelligence models—including their political views.
The work was led by Dan Hendrycks, director of the nonprofit Center for AI Safety and an adviser to xAI. He suggests that the technique could be used to make popular AI models better reflect the will of the electorate. “Maybe in the future, [a model] could be aligned to the specific user,” Hendrycks told WIRED. But in the meantime, he says, a good default would be using election results to steer the views of AI models. He’s not saying a model should necessarily be “Trump all the way,” but he argues after the last election perhaps it should be biased toward Trump slightly, “because he won the popular vote.”
xAI issued a new AI risk framework on February 10 stating that Hendrycks’ utility engineering approach could be used to assess Grok.
Hendrycks led a team from the Center for AI Safety, UC Berkeley, and the University of Pennsylvania that analyzed AI models using a technique borrowed from economics to measure consumers’ preferences for different goods. By testing models across a wide range of hypothetical scenarios, the researchers were able to calculate what’s known as a utility function, a measure of the satisfaction that people derive from a good or service. This allowed them to measure the preferences expressed by different AI models. The researchers determined that they were often consistent rather than haphazard, and showed that these preferences become more ingrained as models get larger and more powerful.
Some research studies have found that AI tools such as ChatGPT are biased towards views expressed by pro-environmental, left-leaning, and libertarian ideologies. In February 2024, Google faced criticism from Musk and others after its Gemini tool was found to be predisposed to generate images that critics branded as “woke,” such as Black vikings and Nazis.
The technique developed by Hendrycks and his collaborators offers a new way to determine how AI models’ perspectives may differ from its users. Eventually, some experts hypothesize, this kind of divergence could become potentially dangerous for very clever and capable models. The researchers show in their study, for instance, that certain models consistently value the existence of AI above that of certain nonhuman animals. The researchers say they also found that models seem to value some people over others, raising its own ethical questions.
Some researchers, Hendrycks included, believe that current methods for aligning models, such as manipulating and blocking their outputs, may not be sufficient if unwanted goals lurk under the surface within the model itself. “We’re gonna have to confront this,” Hendrycks says. “You can’t pretend it’s not there.”
Dylan Hadfield-Menell, a professor at MIT who researches methods for aligning AI with human values, says Hendrycks’ paper suggests a promising direction for AI research. “They find some interesting results,” he says. “The main one that stands out is that as the model scale increases, utility representations get more complete and coherent.”
Hadfield-Menell cautions, however, against drawing too many conclusions about current models. “This work is preliminary,” he adds. “I’d want to see broader scrutiny on the results before drawing strong conclusions.”
Hendrycks and his colleagues measured the political outlook of several prominent AI models, including xAI’s Grok, OpenAI’s GPT-4o, and Meta’s Llama 3.3. Using their technique they were able to compare the values of different models to the policies of specific politicians, including Donald Trump, Kamala Harris, Bernie Sanders, and Republican Representative Marjorie Taylor Greene. All were much closer to former president Joe Biden than any of the other politicians.
The researchers propose a new way to alter a model’s behavior by changing its underlying utility functions instead of imposing guardrails that block certain outputs. Using this approach, Hendrycks and his coauthors develop what they call a Citizen Assembly. This involves collecting US census data on political issues and using the answers to shift the values of an open-source model LLM. The result is a model with values that are consistently closer to those of Trump than those of Biden.
Some AI researchers have previously sought to make AI models with less liberal bias. In February 2023, David Rozado, an independent AI researcher, developed RightWingGPT, a model trained with data from right-leaning books and other sources. Rozado describes Hendrycks’ study as “very interesting and in-depth work.” He adds: “The Citizens Assembly approach to molding AI behavior is also thought-provoking.”
Updated: 2/12/2025, 10:10 am EDT: In the dek, Wired has clarified the methods being researched, and adjusted a sentence to fully elaborate on why a model would reflect the temperature of the electorate.
What sorts of biases have you noticed in your conversations with chatbots? Share your examples and thoughts in the comments below.