Read And transform CSV to Dataframe with defined Schema using Apache Spark and Java

Read And transform CSV To Dataframe With defined Schema using Apache Spark and Java

Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark make reading large CSV files using a defined schema so quick and efficient in just a little lines of code.

1 Create a CSV File

id,name,number,date
1,product 1,5,22/09/2019
2,product 2,10,20/10/2020

2 Create Java function to read CSV file using Spark

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class SparkCSV {

    static public void csvToDataframe() {

    	// Create a spark session
        SparkSession spark = SparkSession.builder()
                .appName("CSV to Dataframe")
                .master("local")
                .getOrCreate();

        // Create our schema
        StructType schema = DataTypes.createStructType(new StructField[]{ //
                DataTypes.createStructField(
                        "id",
                        DataTypes.IntegerType,
                        false),
                DataTypes.createStructField(
                        "name",
                        DataTypes.StringType,
                        true),
                DataTypes.createStructField(
                        "number",
                        DataTypes.StringType,
                        false),
                DataTypes.createStructField(
                        "date",
                        DataTypes.DateType,
                        false)
        });

        // Reading CSV file and converting it to Dataframe
        Dataset<Row> df = spark.read().format("csv")
                .option("header", "true")
                .option("multiline", false)
                .option("sep", ",")
                .option("dateFormat", "dd/mm/yyyy")
				.schema(schema)
                .load("yamicode.csv");

        // Show the first 10 rows   
        df.show(10);

        // Show the schema
        df.printSchema();
    }

}

The createStructType function takes 3 parameters

  • Name: Name of the column (The name doesn't have to match the name on the csv file. It's based on the order)

  • Type: The data type

  • Nullable: Is the column nullable

3 Result

+---+---------+------+----------+
| id|     name|number|      date|
+---+---------+------+----------+
|  1|product 1|     5|2019-01-22|
|  2|product 2|    10|2020-01-20|
+---+---------+------+----------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- number: string (nullable = true)
 |-- date: date (nullable = true)