In this post, we will be looking at how to submit a Spark job in AWS EMR. For this, we have an application that converts any given files to Avro or Parquet format using Apache Spark.
Based on the above structure, the project folder contains all the configurations including build properties and sbt setup. target folder contains the artifacts which include the assembly JAR and other build files. The src folder has all the source code including the business logic to convert file types and unit test suits using the ScalaTest framework.
To run any Spark application we would need to initialise the Spark session first which is present in the src/main/scala/au/fayaz/Utils.scala file :
The initiateSpark function will return a session to perform our DataFrame operations. In this example, we will be converting the file to Parquet or Avro based on the parameter.
The writeOutput function takes the parameters of dataframe, outputPath and format. Based on the format it performs a simple write operation to save the file in an appropriate format with the right compression.