I have been looking for a way to flatten nested JSON files for use with Power BI for some time. Thanks to Microsoft Fabric this is now within reach for the citizen developer.
Prior to Fabric I headed down the Azure Synapse path and stalled due to the complexity of the setup. Hope was renewed once I jumped into Fabric and started learning bits of PySpark used in the Microsoft Fabric Learn content.
I’m now progressively learning Spark by reading Spark – The Definitive Guide and leveraging Chat-GPT as my study partner. I can’t stress enough how beneficial this combo has been. Between help received from Guy in a Cube, and my book/chat learning, I am comfortable enough with Spark to flatten a sample nested JSON using two methods; the first using Pandas, and the second using straight Spark code. Both have their own advantages and use cases, and are shared in this post.
To be clear, this post is not about my own discovery, but rather a shout out for Guy in a Cube for helping me gain momentum and move in the right direction.
Watch the Guy in a Cube Live Stream segment below (at 44 min) that addresses this and other great insights into the automations now available for the citizen developer… Including Da-Wrangler!
Below is a basic step-by-step of what I learned. If you would like to download the entire notebook containing the code: Click here.
Here we go!

Access the live stream and advance to 44:30 of the stream to watch the segment.
Sample JSON
- Copy the orders text below. Paste the text in into Notepad and save the notepad file as orders.json.
[{
"order_id": "10001",
"customer_num": "101001",
"trans_date": "2021-10-27 05:51:01",
"order_type": "CASH SALE",
"total_price": "47.50",
"items": [{
"line_item_id": "1",
"sku": "101",
"title": "item_description1",
"quantity": "2",
"unit_price": "15.50"
},
{
"line_item_id": "2",
"sku": "102",
"title": "item_descriptoin2",
"quantity": "3",
"unit_price": "5.50"
}
]
}, {
"order_id": "10002",
"customer_num": "101005",
"trans_date": "2021-10-28 06:51:01",
"order_type": "AR SALE",
"total_price": "75.50",
"items": [{
"line_item_id": "1",
"sku": "101",
"title": "item_description1",
"quantity": "2",
"unit_price": "12.50"
},
{
"line_item_id": "2",
"sku": "106",
"title": "item_descriptoin2",
"quantity": "1",
"unit_price": "50.50"
}
]
}]
Fabric Setup
2. Enable the Fabric preview in your tenant settings.
3. Once you have Fabric preview enabled in your tenant settings, create a new workspace. I named my workspace ‘FabricJSON‘.

4. Choose the Data Engineering experience (bottom left).

5. Create and name your new Lakehouse. I named my new Lakehouse ‘FABRIC_LH‘.

6. Upload the orders.json file into the Files directory of your new Lakehouse.



7. Refresh your Files folder using the (…) menu. If you do not see the ‘orders.json‘ file after clicking on the Files folder, try refreshing your browser.
Create Notebook and Pandas
8. Within the Data Engineering experience, Create a new Notebook.

9. When the Notebook opens, you’ll see a default code Cell open with the following #Welcome text.

10. Replace the default code using Pandas: Pandas are a Python library that provide high-performance data manipulation and analysis. Pandas provide us with user friendly syntax and a rich set of functions.
Replace the default #Welcome text from above the code cell with the code below, then hit the play/run button to view the result within the notebook. From here on out, just run each new cell to see the results as we progress creating tables.
This first block of code imports the Pandas library; opens our orders file; and uses the ‘json_normalize’ function over each detailed item sold. It’s the Normalize function that allows us to standardize and transform our JSON string into a flattened result.
# The json file is uploaded to the files section of the lakehouse in Fabric.
# Then the file is referenced, open and read into a spark dataframe.
import pandas as pd # Auto File Data/Pandas
import json
json_data = open("/lakehouse/default/" + "Files/orders.json") # OPEN file
MyData = json.load(json_data) # LOAD json
parsed_json = spark.createDataFrame(pd.json_normalize(
MyData, "items", ["order_id", "customer_num","trans_date", "order_type", "total_price"]
)
)
# Display results
display(parsed_json.limit(10))

11. Next, Add a new Cell below the last cell by hovering just below our last table results. Look for the plus (+) symbol.

12. Enter the block of code below to create a table directly built off of our first flattened table from above (parsed_json). This first code cell produces a specific item details table that also includes the order_id column on the far right as a reference key. Once code is entered in the cell, click the run (play) icon to process this cell.
#ITEMS FACT TABLE (Where the order# is brought in from the header)
#The dataframe (parsed_json) from above is then used to get a subset od items.
from delta.tables import DeltaTable
table_name = "Item"
items = parsed_json[['line_item_id', 'sku', 'title', 'quantity', 'unit_price', 'order_id']]
display(items)
# Displays columns in rows from 'items' above, calling on 'parsed_json' which was normalized
# Writes file to Table
items.write.mode("overwrite").format("delta").save("Tables/ItemsFact")
Notice the ‘format(“delta”)’ on line 12? There is no need for a file extension on “Tables/ItemsFact” on that same code line. It is worth noting that any other file extension used other than Delta will isolate this written file into a sub-folder named “Unidentified”. This is a good reminder that One Lake only supports Delta (Parquet) file types in the Table folder. This Table folder is where your tables are made available for further modeling.
13. Enter the following code which creates the second of our two tables which represents our order header table. This table also references our original flattened table (parsed_json), and also includes an additional step that drops duplicate records to leave us with a unique listing of order header information.
#ORDERS HEADER DIMENSION: The dataframe (parsed_json) from above is then used to get a subset of the columns to create a table for orders.
from delta.tables import DeltaTable
orders_with_dups = parsed_json[['order_id','customer_num', 'trans_date', 'order_type', 'total_price']]
orders = orders_with_dups.drop_duplicates() # Adds a function to DROP DUPS
# Display orders from above.
display(orders)
# WRITE file to Table
orders.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("Tables/OrdersDim")
Spark (Non-Pandas) Code
Pandas are flexible for manipulating data using friendly library functions. However, Pandas are limited to processing data solely within memory on one single machine or node. When working with larger datasets, it might be more appropriate to write straight Spark, and leverage the power of distributive processing over a cluster of multiple compute nodes all working in parallel.
As I continued my learning I was eventually able to write Spark (Non-Pandas) code to help me accomplish the same ‘flattened’ result originally achieved using the Pandas library. Use the code below in a new notebook cell to see the result.
14. Enter the below Spark code into a new code cell in your notebook. When you hit ‘run’, you should see the same flattened order table result that was produced the first time using Pandas.
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
spark = SparkSession.builder.getOrCreate()
# Open the file and read its contents
# 'r' defines the mode as read
# 'as file' sets the variable and also ensures spark auto closes the file on completion.
# data = file.read() sets a data var and reads the prior var of file
with open("/lakehouse/default/" + "Files/orders.json", 'r') as file:
data = file.read()
# Parse the JSON data into a python structure (using parentheses)
# loads the JSON into a Python object or structure
data_list = json.loads(data)
# Create a DataFrame from the JSON data
df = spark.createDataFrame(data_list)
# Explode the items column into separate records
# 'col' creates a column object
# we use the dot syntax to reference the columns nested within the python structure
df_exploded = df.select(
col("order_id"),
col("customer_num"),
col("trans_date"),
col("order_type"),
col("total_price"),
explode(col("items")).alias("item")
).select(
col("order_id"),
col("customer_num"),
col("item.line_item_id").alias("LineNum"),
col("item.sku").alias("SKU"),
col("item.title").alias("Title"),
col("item.quantity").alias("QTY"),
col("item.unit_price").alias("UNITPRC")
)
# Display the resulting dataframe
display(df_exploded)
# Writes file to Table
df_exploded.write.mode("overwrite").format("delta").save("Tables/FlatSparkVersion")
15. Now that we have run our Pandas and Spark code, move up to the Tables folder icon. Using the (…) menu options to the right of Tables, choose to refresh the folder. When you expand the Tables folder icon, you should now see the three tables created and written into the Tables folder. Our data is now transformed into flattened tables for use in further modeling!

16. As a final step, lets re-name our notebook to JSON Notebook.
From the workgroup, use the (…) menu to choose Settings for the notebook we just created.

In Settings, overwrite the Name field to ‘JSON Notebook’. Your notebook is now saved with a descriptive name.

More on Pandas vs. Spark from Chat-GPT
When considering using Pandas or writing Spark, Chat-GTP provided me these points to consider.
Spark code is more suitable for working with large datasets compared to using pandas for a few reasons:
- Distributed Computing: Spark is designed to handle large-scale data processing by distributing the computation across a cluster of machines. It can efficiently handle data that exceeds the memory capacity of a single machine, making it well-suited for big data scenarios.
- Scalability: Spark can scale horizontally by adding more machines to the cluster, enabling it to process larger datasets as needed. Pandas, on the other hand, is primarily designed for single-machine processing and may encounter memory limitations when dealing with large datasets.
- Performance Optimization: Spark’s underlying execution engine optimizes the execution plan based on the distributed nature of the data. It performs lazy evaluation and applies various optimization techniques, such as partitioning and data locality awareness, to minimize data shuffling and optimize processing speed. This can lead to faster processing times for large datasets compared to pandas.
- Fault Tolerance: Spark provides built-in fault tolerance mechanisms, allowing it to recover from failures and continue processing without losing data. This is particularly crucial when working with large datasets, as failures are more likely to occur during long-running computations.
However, it’s important to note that the choice between Spark and pandas depends on many factors, including the specific requirements of your use case and the available resources. If your dataset fits comfortably in memory and the processing requirements are not too demanding, pandas may provide a simpler and more convenient solution. Spark’s overhead and complexity may not be necessary for smaller datasets or quick exploratory analysis tasks.
Thank You
Thank you for reading! If you are a Power BI citizen developer like me, Fabric is a great opportunity to up-skill in an environment you are familiar with. Chat-GPT and new automations in the public preview have made learning very efficient and meaningful for me. It was just what I needed to build momentum and commit to learning a new language. Last but not least, a big thank you to Patrick Leblanc, Adam Saxton, Alex Powers, and Austin Henley for providing community content for this amazing platform.
Leave a Reply