Spread the love

CHAPTER 1

Introduction:

In today’s Data driven world, a cloud based and cloud managed platform like Databricks can provide data analyst, data engineer or a data science enthusiast a revolutionary workspace to practice Apache spark, run notebooks and perform ETL (Extract Transform Load) workflows.

What is Databricks then?

Databricks is nothing, but a platform that enables enterprises to quickly build their data Lakehouse infrastructure and allows data enthusiasts or BI personnel or a data scientist to perform ETL workflows and deliver the insights from the data. Apart from other cloud platforms Databricks also supports advanced concepts like the Data Lakehouse architecture, which combines features of data lakes and data warehouses that’s why Databricks is more reliable and efficient than other cloud platforms — but in this beginner-friendly blog, we’ll focus on the services provided by the Databricks Free edition.

Databricks Free edition:

Databricks Free edition is a free version of Databricks provided to learn and use the Databricks UI and all the basic and semi-advanced operations as a learner.

We can explore the following things:
Apache Spark capabilities
Dataframes and transformations
Notebook-based development
Collaborative data workflows
SQL, Python, Scala, and R environments

LET’S NOW EXPLORE THE DATABRICKS FREE EDITION:

To get started, head over to: https://www.databricks.com/learn/free-edition

Fig (1)

This is how the signup page looks like for the community free edition just go for the signup for free. Signup with your email id.

Fig (2)

The above page is the opening page of Databricks
Let’s discuss about the Databricks sidebar or navigation panel in the above picture.

1.Workspace – This is where your personal folders, notebooks, and scripts are stored. It’s like your working area inside Databricks.
2.Catalog – It shows databases, tables, and views. Useful for browsing and managing structured data.
3.Compute – Lets you manage clusters (computing power) used to run notebooks and jobs.
4.SQL Editor – A place to write and run SQL queries directly on your data.
5.Dashboards – Helps you create and view visual dashboards based on your query results.
6.Data Ingestion – A tool to help you upload or import data into Databricks from various sources.
7.Workspace Dropdown – This lets you switch between different views like Workspace, Repos, or other folder locations

Fig (3)

This is the area where you can see the Catalog bar. Catalog bar is basically designed to handle objectives like load the data and work with tables. Here we can browse the tables that are available in the current workspace/database and also can use this data to perform various transformations.

Some of the components of the compute are:
Default- The default database is simply a default location where all the basic tables are stored.
Samples-Samples are basically preloaded datasets that one can explore to perform beginner transformations without the need to upload own files.

Fig (4)

Compute-Compute is basically a Engine that runs the code written in the Workspace. Compute is also reffered as a Cluster. A cluster in Databricks is a group of Virtual Machines (or cloud servers) that:
1. Executes PySpark, SQL, or Python code
2. Loads and transform large datasets
3. Trains machine learning models
4.Runs notebooks and jobs4.Runs notebooks and jobs

Just to make you understand in a simple language I am giving an example: Imagine Databricks is a car dashboard (Notebook), and you’re the driver (User).
You can press the accelerator (Run code), but nothing will happen unless the engine (compute) is on.

There is only one serverless compute or a warehouse because it is a community edition free version and the limit is only one at a time . You can add a new one only after deleting the existing one.Databricks provides a ‘ready-to-use’ SQL Engine so that one can start querying data right away.
For understanding we can have a example : Suppose a student walks into a classroom without any notebook or laptop,teacher gives that student a pre set laptop that has everything installed in it.

Fig (5)

Data Ingestion simply means bringing data into Databricks so that one can analyze the data. Some Components of Data Ingestion are:
1.Add Data-It is the section where one can load the data into the Databricks and start working on it.
2.SalesForce – Bring in customer or sales data directly from Salesforce.
3. Workday Reports- Import employee or HR data.
4. ServiceNow- Connect IT operations or helpdesk logs.

Then the Middle Files Section helps to perform
1.Create or Modify Table-Upload Excel/CSV files to turn them into a table.
2. Upload Files to a Volume- Add any file format (text, JSON, etc) for advanced use
3. Create Table from Amazon S3- Import files from Amazon’s cloud storage if cloud data is present there.

Then the Bottom Section helps to connect to cloud storage or third-party tools.

Fig (6)

Workspace enables the user to create their own multiple notebooks, so-as to write and run codes. Here is the step-by-step guide for using the notebook.
On the left-hand sidebar, click on Workspace → then expand the workspace folder (you may see options like “Shared”, “Repos”, “Users”).
In the top-right corner of the page, click the blue Create button. You will see a dropdown list.
From the dropdown list click on the Folder, then a pop-up will ask you to name the folder, after giving a name click on create.
Then inside the new Folder again click on Create, then from the dropdown list click on notebook, then you will be directed to the notebook view, rename the notebook, and then start with loading and transforming your dataset.

Fig (7)

The above picture is giving an idea of the Notebook view,
Within the box you can start writing your code, by default the language for this notebook is Python, you can change the language to one of your choice i. e Python, Scala, SQL, R.
One can also find a View called “Notebook and Dashboard View, where you can create the Dashboard view/Visualization for the transformation you have performed for the dataset.

Fig (8)

To start the execution of a code in a cell you should choose a computing. By clicking on the connect you will get the option for SQL serverless starter for SQL queries and only serverless for the python code running.

Till here was a brief overview about the Databricks community UI and all the basic spaces we need operate.

CHAPTER 2

Let’s demonstrate a small project using .csv file to understand how Databricks Works

To practically understand how the Databricks platform is used, let’s walk through a small healthcare project using a .csv file. We’ll use Python + Apache Spark, commonly known as PySpark, to load, process, and analyze the data inside a Databricks notebook.

This project shows how to:

Read .csv files using PySpark
Clean and transform healthcare data
Join multiple datasets (patients, conditions, encounters)
Perform real-world analytics
Save the result using Delta format

Here is the sample description of the datasets we manually uploaded to the volume using the Data Ingestion as per the fig (5).

1. patients.csv

This file contains demographic and personal details of each patient.
It includes the following information:

ID: Unique identifier for each patient
BIRTHDATE: The patient’s date of birth
GENDER: Gender of the patient (Male or Female)
ADDRESS, CITY, STATE: Location details
PASSPORT, DRIVERS, ZIP, MARITAL: Additional identity or demographic information
DEATHDATE: Date of death, if applicable

This dataset gives us insights into who the patients are and allows grouping by age, gender, and location.

2. encounters.csv

This file records each visit or interaction a patient has with a hospital or clinic.
Key columns include:

ID: Unique identifier for each encounter
PATIENT_ID: Reference to the patient who had the encounter
ENCOUNTERCLASS: Type of encounter (e.g., inpatient, outpatient)
START, STOP: Start and end dates of the visit
PAYER, PAYER_COVERAGE: Insurance provider and their coverage amount
TOTAL_CLAIM_COST: Total cost charged for the visit
BASE_ENCOUNTER_COST: Base cost before insurance or adjustments

This dataset helps track patient visits, costs, and insurance activity.

3. conditions.csv

This file contains diagnosed health conditions for each patient.
It includes:

ID: Unique ID for each condition record
PATIENT_ID: Reference to the patient diagnosed
DESCRIPTION: Name of the diagnosed condition (e.g., Hypertension, Asthma)
START, STOP: Dates when the condition started and ended
CODE: Medical code (e.g., ICD) representing the condition

The above are the data sets description we will be using in out project.

Step 1. Loading the csv file in databricks using PySpark:

Before we load the data, we have to import the necessary PySpark function and types:

After successful importing, now its time to load the data.

The option (“header”, True) ensures that the first row is treated as column headers.

The above are the preview of the loaded datasets

Step 2. Data Cleaning and Preprocessing:

Handling Missing values (NULL)-

The most Null and less relevant columns should be dropped

Columns dropped

If we find nulls in important columns like PASSPORT, we can replace them with a default value:

Renaming Columns for Consistency

Step 3. Joining Datasets for deeper insights-

Now that our data is cleaned and standardized, we can start joining the datasets to uncover deeper relationships between patients, their medical conditions, and their encounter history.

We will create three main joined DataFrames:

df_joined_oep → Encounters + Patients
df_joined_ec → Encounters + Conditions
df_joined_cp → Conditions + Patients

The above is the join operations performed on the data sets

These joined datasets now allow us to perform powerful groupings, aggregations, and even readmission analysis.

Step 4. Calculating Age and Grouping Patients-

Age is an important factor in healthcare analytics. Databricks makes it easy to calculate age from the BIRTHDATE column using PySpark functions. Once we calculate the age, we can group patients into meaningful age ranges for deeper analysis.

Group age by range:

The above program will add a column in df_joined_oep with the name AGE_GROUP

Preview of the above code

Grouping Patients Visits by AGE_GROUP

We can also apply visualization by clicking on the ‘+’ symbol in the output console and choose Your kind of visualization

Similar analysis and exploration can be according to the requirements

After completing the required EDA (Exploratory Data Analysis) the final step is to save the data files in delta format.

Step 5. Saving the Final DataFrames in Delta Format

After cleaning, joining, and analyzing the data, it’s a good practice to save the final datasets. Databricks supports many formats, but Delta is preferred because it’s fast, reliable, and supports versioning and ACID transactions

Before saving, we make sure each DataFrame has no duplicate rows using dropDuplicates().
This prevents storage bloat and ensures data accuracy.
.write initiates the saving process on the DataFrame.
.format (“delta”) tells Spark to save the file using Delta Lake format, a highly efficient and reliable format built by Databricks.
.mode (“overwrite”) allows the system to replace existing data in that location, if it exists.
.save (“<path>”) defines the storage location in Databricks’ file system (e.g., /Volumes/dpk/…) where the Delta file will be stored.

AS SAID IN FIG (7) WE CAN ALSO HAVE A DASHBOARD VIEW IN THE FREE COMMUNITY EDITION
We can make the dashboard view by importing the graphs and tables of the notebook.

Sample is as shown below:

Chapter 3

CONCLUSION:

In this blog, we explored how to use Databricks Free edition along with Python and Apache Spark (PySpark) to build a real-world data analysis project using healthcare data in .csv format.

Databricks supports many other data formats too, including:

JSON – for semi-structured records
Parquet – for efficient columnar storage
Avro, ORC, Delta, and even database connections
Integration with Azure, AWS S3, Google Cloud Storage, and more

All of this is available in a free cloud notebook environment — with no local setup or configuration required.
This was just one example of what’s possible. Databricks offers much more: machine learning, SQL analytics, streaming data, Unity Catalog, and powerful collaboration features.
Machine Learning-Databricks can build and train models to make predictions from data (like predicting disease risk or customer churn).
SQL Analytics- Databricks can run SQL queries to analyze big data and even build dashboards — just like using Excel or a database.
Streaming Data- Databricks can process real-time data or can work with live data as it flows in (like live sensor data).
Unity Catalog-It helps to organize and secure data like folders and permissions for big datasets.
There’s still a lot more to explore in Databricks. This is just the beginning.