I can't think of a single large software company that doesn't regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?
A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google's $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn't a trivial problem. But in the comments on Alex's posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google's capabilities in the next few years. It's been long enough since then that we can look back and say that Lucene hasn't improved so much that Google is in danger from a startup that puts together a Lucene cluster. If anything, the cost of creating a viable competitor to Google search has gone up.
For making a viable Google competitor, I believe that ranking is a harder problem than indexing, but even if we just look at indexing, there are individual domains that contain on the order of one trillion pages we might want to index (like Twitter) and I'd guess that we can find on the order a trillion domains. If you try to configure any off-the-shelf search index to hold an index of some number of trillions of items to handle a load of, say, 1/100th Google's load, with a latency budget of, say, 100ms (most of the latency should be for ranking, not indexing), I think you'll find that this isn't trivial. And if you use Google to search Twitter, you can observe that, at least for select users or tweets, Google indexes Twitter quickly enough that it's basically real-time from the standpoint of users. Anyone who's tried to do real-time indexing with Lucene on a large corpus under high load will also find this to be non-trivial. You might say that this isn't totally fair since it's possible to find tweets that aren't indexed by major search engines, but if you want to make a call on what to index or not, well, that's also a problem that's non trivial in the general case. And we're only talking about indexing here, indexing is one of the easier parts of building a search engine.