Presumably these are being built right now. But which texts will they be trained upon? Let’s say you can keep out any talk of T. Square. What about broader Chinese history? Do you allow English-language sources? Japanese-language accounts of the war with Japan? Do you allow economics blogs in English? JStor? Discussions of John Stuart Mill on free speech?
Just how good is the Chinese-language, censorship-passed body of training data? Does China end up with a much worse set of LLMs? Or do they in essence anglicize most of what they learn and in time know?
Pre-LLM news censorship was an easier problem, because you could let the stock sit in a library somewhere, mostly neglected, while regulating the flow. But when the new flow is so directly derived from the stock, statistically speaking that is? What then?