Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun

submited by
Style Pass
2024-05-09 13:00:21

The above is a map of all Hacker News posts since its founding, laid semantically i.e. where there should be some relationship between positions and distances. I've been building it and some other interesting stuff over the past few weeks, to play around with text embeddings. Given that HN has a lot of interesting, curated content and exposes all its content programatically, I thought it'd be a fun place to start.

A quick primer of embeddings: they are a powerful and cool way to represent something (in this case, text) as a point in a high-dimensional space, which in practical terms just means an array of floats, one for its coordinate in that dimension. The absolute position doesn't really mean much, but their relativity to each other is where much of their usefulness comes in, because "similar" things should be nearby, while dissimilar things are far apart. Text embeddings these days often come from language models, given their SOTA understanding of the meaning of text, and it's pretty trivial to generate them given the high-quality open source models and libraries, that are freely accessible to anyone with a CPU or GPU.

These sounded pretty interesting, so I decided to dive right in. In this blog post, I'll lay out my journey starting from no data and no code, to interactive search, analysis, and spatial visualization tools leveraging millions of HN content, dive into all the interesting diverse problems and solutions that came up along the way, and hopefully give you some indication (and hopefully motivation) of the power and applicability of embeddings in many areas.

Leave a Comment