In order to make safer AI, we need to understand why it actually does unsafe things. Why:
systems optimizing seemingly benign objectives could nevertheless pursue strategies misaligned with human values or intentions
Otherwise we run the risk of playing games of whack-a-mole in which patterns that violate our intended constraints on AI’s behaviors may continue to emerge given the right conditions.
[Edited for clarity]


"we trained it on records of humans and now it responds like a human! how could this happen???”