Read And transform CSV To Dataframe using Apache Spark and Java

Read And transform CSV To Dataframe using Apache Spark and Java

Spark is an open source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark make reading large CSV files so quick and efficient in just a little lines of Code.

1 Create a CSV File

id,name,number,date
1,product 1,5,22/09/2019
2,product 2,10,20/10/2020

2 Create Java function to read CSV file using Spark

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class SparkCSV {
	
	static public void csvToDataframe() {

		//Create a spark session
		SparkSession spark = SparkSession.builder()
		        .appName("CSV to Dataframe")
		        .master("local")
		        .getOrCreate();
		 
		// Reading CSV file and converting it to Dataframe
		Dataset<Row> df = spark.read().format("csv")
		    .option("header", "true")
		    .option("multiline", false)
		    .option("sep", ",")
		    .option("dateFormat", "dd/mm/yyyy")
		    .option("inferSchema", true)
		    .load("yamicode.csv");
		 	
		// Show the first 10 rows
		df.show(10);
	}
	
}
  • Header: Does the file contains a header

  • Multiline: Is the row in CSV file represented in multiple lines

  • Sep: The separator that separat columns (, or ; or |)

  • dateFormat: The date format that spark would expect to read data as date type

  • inferSchema: Let spark handle the schema of the columns