Web Scraping with your Web Browser: Why Not?

submited by
Style Pass
2024-10-01 18:00:02

You can find plenty of tutorials on the Internet about the art of web scraping (for example, here and here) and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser though you will find browser extensions that claim to do it all without any need for coding (this only works for simplistic and unprotected websites). So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?

Web scraping has a long history, even pre-dating the advent of Javascript in its current incarnation. Though Javascript was first introduced by Netscape in 1995, it would take another decade for it to mature into a language suitable for much more than managing the presentation of a web page. Python was introduced in 1991 and was a fairly mature language from the start. Due to simple inertia, it remains the predominant tool for web scraping. Over the past several years, Node.js has seen rising interest and there are tools written in Javascript for Node.js but still nothing for web browsers.

One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply. The latter choice is more flexible but requires the presence of an external resource. Python gets around the CORS limitations because the web browser is not involved. Unfortunately, this also means that additional support is required in order to parse HTML or JSON and perhaps even to execute Javascript in cases where a resource is protected by obfuscation.

Leave a Comment