Read And transform JSON to Dataframe using Apache Spark and Java

Read And transform JSON To Dataframe using Apache Spark and Java

Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark make reading large JSON files using a so quick and efficient in just a little lines of code.

1 Create a JSON File

[
  {
    "id": 1,
    "name": "product 1",
    "number": 50,
    "date": "22/09/2019"
  },
  {
    "id": 2,
    "name": "product 2",
    "number": 25,
    "date": "21/08/2020"
  }
]

2 Create Java function to read JSON file using Spark

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

public class SparkJSON {


    public void jsonToDataframe() {

    	// Create a spark session
        SparkSession spark = SparkSession.builder()
                .appName("JSON to Dataframe")
                .master("local")
                .getOrCreate();

        // Reading JSON file and converting it to Dataframe
        Dataset<Row> df = spark.read().format("json")
                .option("multiline", true)
                .load("yamicode.json");

        // Show the first 10 rows 
        df.show(10);

        // Show the schema
        df.printSchema();
    }

}

3 Result

+----------+---+---------+------+-------------+
|      date| id|     name|number|subCategories|
+----------+---+---------+------+-------------+
|22/09/2019|  1|product 1|    50|  [[3, sub1]]|
|21/08/2020|  2|product 2|    25|  [[4, sub2]]|
+----------+---+---------+------+-------------+

root
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- number: long (nullable = true)
 |-- subCategories: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)