This exceedinly simple. But if you don’t find tricks to speed up things, even at 1,000 pairs examined per second, it will take 23 days to examine all 2 billion pairs 🥲.
Using OpenAlex data, I retrieved 20 million journal articles, resulting in a 65Gb json file. Extracting the journal name and author ids for each article, this resulted in a list where each entry is a journal with all the authors who published in it. There are 200,000 journals and with the list of their authors this fills a 0.9Gb file.
Now the goal is to compute the similarity between each pair of journals, “similarity” being measured as the number of authors that jointly published in both.
This makes 2 billion pairs to evaluate: 200,000 x 200,000 divided by 2 (because when we compared journal A to B, no need to compare journal B to A again).
It is a conscious decision not to go for clusters, GPUs and cloud infrastructure if possible, in order to lower the barrier of entry for those who would like to contribute or just fork and run the project by themselves.