Pandas API on Apache Spark - Part 1: Introduction

submited by
Style Pass
2021-07-21 09:30:03

Apache Spark has revolutionised the data science field with it’s support for big data. With it support for multiple languages like Scala, Python it has made big data analysis available to wide variety of developers.

Python is the leading language preferred by the data science community. Even with in Spark community, python API has seen tremendous upsurge in last few years. According to databricks, company behind the Apache Spark, 60% of the commands written on their notebook is python compared to 23% of them in Scala.

Spark has excellent support for python with Pyspark project. Pyspark allows developers to access all different parts of spark like SQL,ML etc using python language. Still it has not yet reached wider python community. The reason is majority of python data developers prefer Pandas API.

Pandas is the de facto library for python data science community to manipulate data in python. Also pandas integrates seamlessly with other python libraries like plotly, scikit learn. So most developers are very comfortable with it’s API.

Leave a Comment