The Microsoft Fabric Learn content and other community resources has been meaningful in my journey from Power BI into the field of data engineering. This post is a collection of my go-to learning resources and notes from my journey so far.
I’m specifically intrigued by learning PySpark in Fabric Notebooks. PySpark is an open-source Python API designed for big data processing allowing us to interact with Apache Spark.
I’m reading PySpark books along side Chat GPT to help speed up my learning of core concepts. I use AI not as a shortcut, but as a study partner to help reinforce new concepts against credible sources. Here are some of my favorite learning resources so far:
Learning resource links:
Why PySpark Notebooks?
Notebooks allow for a logical and structured approach to data cleansing and transformation. My attraction to PySpark comes from it’s user friendly interface in Fabric. Rather than writing code line by line, I break code up logically between data loading, exploration, processing, and finally visualization all within organized and well documented blocks of code. More specifically Notebooks allow for:
- Experimenting: Breaking code into isolated steps called cells makes debugging (similar to DAX variables) a clearer path to refining performance. Within each code block or cell I can choose to read a file, then add, change, merge, remove columns and rows as needed. As I progress cell-by-cell, I can check my result midway through my overall transformation. Other comforting and important steps include validating my table schema (column data types) and previewing a sample set of rows in my table before moving on to the next cell step in my transformation.
- Markdown Cells: Choosing this option over a code cell allows me to use cells to create title headers, notes, or as a safe spot to store other versions of code I would normally need to comment out. Markdown cells help in making workflow more descriptive and narrative-like.
- Visualizations: Notebooks support creating charts, plots, and graphs, making data analysis visual and intuitive. Complex data transformations or patterns can be more easily understood when visualized with Python libraries such as Matplotlib, Seaborn, or Plotly. This allows me to combine the power of PySpark data processing with the visualization capabilities of these libraries.
- Multilanguage Support: Notebooks support multiple languages including Python, Scala, R, and SQL which can be beneficial when working with PySpark. I am sticking with PySpark for now, but Fabric makes it easy to switch between default languages in the overall UI, or explicitly within a specific code cell.
- Collaboration: Notebooks in Fabric support collaborative editing allowing developers to share code, and work together in real-time. Peer learning is a powerful tool, and notebooks facilitate this.
- Reproducibility: Notebooks can be saved, shared, rerun, and exported ensuring reproducibility. This means I can revisit my work online or share it with others for feedback.
In essence, PySpark notebooks offer a environment that caters to different learning styles, be it hands-on, visual interpretations, collaborative learning, or through documentation.
As for overall structure we’ve already got a strong head start as BI developers:
- Data Structures: PySpark operates over DataFrames. Think of a data frame as a structured table in Excel with rows and columns. The efficiency of DataFrames comes from being able to filter and evaluate over columns of data (versus being limited to row by row evaluation). DataFrames as a structured table should be familiar to the BI developer. DataFrames also let us perform familiar operations like filters, joins, and aggregations. More on DataFrames and their row-processing alternative below.
- SQL Friendly: PySpark supports SQL! You can run SQL queries directly on DataFrames using the
spark.sql()function. So, all that SQL knowledge is still very relevant.
One key advantage of using Notebooks is its sheer computing power:
- DataFrames As previously mentioned, DataFrames are faster and more performant than the more granular structure known as Resilient Distributed Datasets (RDDs). DataFrames evaluate over structured rows and columns much like the structured table we know and love. RDDs evaluate over granular rows. RDDs are great if you require a lower row level evaluation, but this often comes with a performance cost.
- Distributed Computing: Unlike traditional databases, Spark operates on a distributed compute system. This is good news when processing large datasets as your processing can be split across multiple compute nodes running in the background. While this ensures speed and scalability, it also introduces nuances, like data shuffling. Data shuffling is an additional compute process necessary when having to re-sort or re-group data to achieve the result. For example; data shuffling is required with large DataFrames when joining two table structures (DataFrames) together. Joins ideally should be performed on a single machine to align with each other, otherwise shuffling between nodes will likely slow performance. It’s generally more efficient to perform joins on smaller subsets of data or tables that can fit in memory on a single machine to avoid excessive shuffling.
Other Core Concepts:
The way we interact with and manipulate our data structures is fundamentally categorized into Transformations and Actions. These are critical concepts when working with PySpark:
Transformations: Operations applied to an RDD or DataFrame result in a new RDD or DataFrame. They are not executed immediately but are instead recorded (this is known as “lazy evaluation”). This allows Spark to self optimize the sequence of transformations before any action is executed.
In PySpark, you can perform various transformations on DataFrames to manipulate and process data. Here are some common examples of transformations:
- Select: You can select specific columns from a DataFrame.
- Filter: You can filter rows based on a condition.
df.filter(df["column_name"] > 10)
- GroupBy and Aggregation: You can group data by one or more columns and apply aggregate functions.
- Sort: You can sort the DataFrame based on one or more columns.
- Join: You can join two DataFrames based on a common column.
- Drop: You can drop columns from the DataFrame.
- WithColumn: You can add or replace columns with new values or transformations.
df.withColumn("new_column", df["column_name"] * 2)
- Alias: You can rename columns.
- Pivot: You can pivot data to reshape it.
- Window Functions: You can perform window functions like ranking and aggregation over specified windows of data.
from pyspark.sql.window import Window from pyspark.sql.functions import row_number window_spec = Window.partitionBy("category").orderBy("value") df.withColumn("rank", row_number().over(window_spec))
These are just a few examples of the transformations you can perform with PySpark.
Actions: Actions are operations that produce a result. When an action is called, Spark computes the transformed RDD or DataFrame, processing the data. This is when the execution actually takes place and data is processed.
Here are some common actions performed on PySpark DataFrames:
- show(): Displays the top rows of the DataFrame in a tabular format.
- count(): Returns the number of rows in the DataFrame.
- collect(): Retrieves all the data from the DataFrame to the driver program. Use with caution for large datasets, as it can cause out-of-memory errors.
- first(): Returns the first row of the DataFrame as a Row object.
- take(n): Returns the first ‘n’ rows of the DataFrame as a list.
- sum(), mean(), min(), max(): Compute aggregate statistics on numerical columns.
- groupBy() and aggregation functions: Perform grouping and aggregation operations.
- describe(): Generates summary statistics of numerical columns.
- printSchema(): Displays the schema of the DataFrame.
- write: Actions related to writing data, such as
write.csv(), etc., for saving DataFrames to external storage.
- foreach(): Allows you to apply a function to each row of the DataFrame. Useful for custom processing.
df.foreach(lambda row: custom_function(row))
Remember that actions trigger the execution of the entire PySpark job, so they should be used with care, especially on large datasets, as they can be resource-intensive.
In essence, transformations build up a logical execution plan, and actions kick off the actual computation or result. Understanding the distinction is fundamental to efficiently using Spark, as it allows for optimization opportunities and helps avoid other expensive operations until absolutely necessary.
Lazy Evaluation: PySpark benefits from lazy evaluation in that transformation steps will not compute until an action is called on. I relate this to lazy evaluation in DAX. DAX measure variables that don’t contribute to the result (the action) are not processed and help improve processing performance.
Schema Management: PySpark offers flexibility with dynamic schema inference (column types) similar to how Power Query inserts its own “helpful” Changed Type step. It is easy to depend on this automatic detection, however as I progress in my PySpark learning, I’m learning to value and create a habit of defining explicit table schemas upfront to help maintain data integrity and improve performance.
As I get comfortable with syntax and writing PySpark, I’m narrowing in on frequently used key functions to transform my data. PySpark offers an arsenal of functions, some of the most common functions are listed below. Similar to the step approach in Power Query M code, each transformation step in PySpark is prefaced with a name given to the current state of the DataFrame, I.e. df = <function>. Where df is a common reference for DataFrame.
- Filtering out unwanted rows:
df = df.filter(df["column_name"]>100) # or df = df.where(df["column_name"] > 100)
- Removing unwanted columns:
df = df.drop("unwanted_column")
- Appending two tables together:
union()df_combined = df1.union(df2)
- Left joining a column from table 2 into table 1:
df_joined = df1.join(df2, on="common_column", how="left")
- Removing duplicate records:
dropDuplicates()df_no_duplicates = df.dropDuplicates()
- Adding a new conditional column:
withColumn()in combination with
from pyspark.sql.functions import when df = df.withColumn("new_column", when(df["column_name"] > 100, "High").otherwise("Low"))
- Modifying an existing column:
df = df.withColumn("existing_column", df["existing_column"] * 2)
As I continue reading, I pick up new methods and functions allowing me to create powerful transformations and manipulations on DataFrames.
Stepping into the world of data engineering with Fabric notebooks and PySpark has felt like moving from the shallow end into the deep end. Thankfully the skills and concepts learned in Power BI along with community resources have helped me move there gracefully. Thank you for reading and being a part of this great community!