Healthcare Data Engineering using Databricks and Delta Lake

Spread the love

In order to create an user-friendly dashboard in Databricks, first of all we need to bring the dataset from the Data Source (eg: Oracle, SQL Server, S3, Azure etc.) to Databricks Environment.

In this blog, I’ll walk through to connect a Health Care Dataset from an Oracle Database into Databricks, then organize it inside catalogs and schemas, and prepare it for dashboard building—all step by step with proper screenshots.

Let’s start exploring a new path of Databricks.

Fig (1)

In Fig (1), on the left-hand sidebar click on the Catalog, this will show all the catalogs, schemas and tables available in the workspace. At the top of the catalog pane, click on the (+) icon, and from the dropdown menu click on “Create a Connection”.

Fig (2)

After selecting Create a connection, a guided Setup Connection opens Fig (2) in which the step 1 is “Connection Basics” where we need to enter the required connection details such as:

Connection name-here we need to enter a unique name for the connection.
Connection type-here we need to select the connection type from the dropdown list.

After the details are entered click Next.

Fig (3)

After completing the basic connection details, the next step in the setup is “Authentication” Fig (3).
This is where we will provide Databricks with the information it needs to securely connect to the database.
The fields include:

Host, port, and user
Authentication password and Encryption protocol

Once all authentication details were filled in, click on “Create Connection”. This will save the connection and allow Databricks to successfully communicate with the database.

Fig (4)

After successfully setting up the connection and authentication, the next step is to define how Databricks will organize and expose the data internally. This is done in the “Catalog Basics” section Fig (4).
Here’s what each field means:

Catalog Name- this field identifies the name of the catalog that will appear inside Databricks to represent the external database system.
Connection- this field is automatically filled based on the connection we created in Step 1.
Service Name- this field identifies the database that Databricks should connect to.
Validation- Databricks provides a built-in option to verify if the connection and service details are correct. By clicking on “Test connection”, Databricks attempts to connect to the database service using the credentials and host information provided earlier. If the validation succeeds which is shown in the Fig (5), (6), then we know the setup is working properly.

Fig (5)

Fig (6)

After the test connection is successfully established, click on Close and move further.

Fig (7)

Once all fields present in connection basics are filled in and connection is validated, click on Create Catalog Fig (4), it will show a notification for the ‘Successful’ creation of the Catalog Fig (7).

Fig (8)

Then in the next step, we come across the “Access Section” Fig (8), here we can control who can access the catalog, workspace and what permissions they will have. The permissions are as follow:

Which user can use the workspaces.
Who owns the catalog.
Which users or groups can use this catalog. Here by default all the account users are granted access.

Fig (9)

Now comes the last step of the setup connection, the “Metadata Section” Fig (9). Here we can add extra information to our catalog so it becomes easier to discover, search, classify, and govern. This step is optional, but ensures better organization and governance.

This includes:

Tags- It helps teams to quickly find the right datasets as well as enables policy based access control.
Key: To choose from the predefined governed tags (created by the admin).
Value: We have to pick the value for that tag.
Multiple tags can be added.

If tags are not configured, then this list may be empty. This basically means

Our workspace does not currently have any predefined or governed tags set up.
Because of this, the dropdown list that normally shows available tag keys and values is blank.
First tags must be created (usually done by admins in the Governance → Tags section) before they appear here.

After all the details are filled successfully, click on Save. Now Our Catalog is successfully created and it is ready to be used in Databricks.

Fig (10)

Once the catalog is created and the connection to the database was successfully validated, the next step is to query the Healthcare Dataset directly from Databricks.

Note: In Databricks, there is a default limit of 51 schemas, to prevent that we use SQL Editor. This can also be done by using Pyspark.

To do so, we need to open SQL Editor on the left-hand sidebar as shown in Fig (10), now here in the left side we can view all the catalogs available in Databricks including the new catalog (i.e hcd_Databricks_catalog).

Inside this new catalog, Databricks exposed the schemas and tables just like normal Databricks objects.

Now the Healthcare Dataset is available in Databrick’s ‘hcd_Databricks_catalog’.
Here we can directly query, preview and analyze the healthcare dataset.
Now it is easy to understand the dataset before transforming it into a new catalog where only the required tables for my dashboard are stored.

Fig (11)

After setting up the connection and creating the catalog, the next step is to load the required tables by reading from Source database into Databricks and writing the data into a Databricks Delta Table as shown in Fig (11).

Now move on to the Workspace present in the left-hand sidebar, then create a Notebook, and thereover write the codes.
The first line of code reads the Table from the Source Database catalog inside Databricks.
Whereas the second line of code converts this data into a Delta Table stored inside Databricks.

Note- Reading the data directly from source database works but it is not ideal for transformations or dashboards. It makes External Queries slower, costly, and create dependency on the source database.
Once the write operation is complete, Databricks allows us to preview the new table. At this point, the table is fully ready to be used for:

Data Modeling
Cleaning and transformation
Dashboard building
Analytics and reporting

Fig (12)

We can also refer to another example shown in Fig (12) similar to the example shown in Fig (11).

In this image the data is read from the Source Database Catalog stored inside Databricks named as ‘hcd_konark_oracle_db_catalog’. And then it writes into the Delta table present inside another catalog named ‘hcd_operation_base’_table.
Fig (13)

Until now we have covered all the ideas pre-requisite for creating a dashboard. Now let’s move forward towards creating a user friendly dashboard in Databricks. Click on the “Dashboard” present in the left-hand sidebar. Then click on “Create Dashboard” present in the top right corner. A new blank page appears Fig (13).

Fig (14)

After navigating to the Dashboards workspace, we will see an blank dashboard interface prompting us to add data. Databricks allows us to create datasets in multiple ways:

Create From SQL – Write our own SQL query to fetch the required data.
Add Data Source – Select an existing table from Catalog.
Upload File – Upload CSV, Excel, or other files.

Since my data was sourced from Oracle and already stored in a Databricks Catalog, so I used “Create from SQL” to build custom datasets Fig (14).

Fig (15)

Fig (16)

Fig (15) and Fig (16) are two sample examples which shows the query written for Healthcare dataset. Each metric in the dashboard was built using an independent SQL query created as a dataset. Once executed, the query displayed a result table containing values. Each dataset added on the left pane becomes a building block for visualizations later. This helps separate different clinical KPIs and keeps the dashboard modular which means we can easily add, remove, or modify individual visuals without affecting the entire dashboard. Similarly, I created multiple datasets—Clinical Query 1 to Clinical Query 22—each representing a unique clinical performance metric.

DASHBOARD

Fig (17)

Fig (18)

In Fig (17) and Fig (18), After preparing all datasets, I switched to the Dashboard View and created some visualization by using the drag and drop interface. Once all widgets were added, the dashboard displayed a complete clinical overview.

Now, Lets get started with the incremental load.

What is incremental load?

Imagine you have a big notebook where you write customer orders every day.

Full load = You rewrite the whole notebook again every day
Incremental load = You can add, update or delete records since the last load, rather than reloading the entire data

So technically, Incremental load is a process of updating a data warehouse where only new or modified records since the last load are extracted, transformed, and loaded into the target system.
Instead of processing the entire dataset again, it processes only the delta .

Here we would be doing incremental load dynamically by scheduling jobs and adding tasks in that job and also the SQL scripts to be used has upsert (update and insert) method which is crucial for incremental load.

So we have to write the SQL scripts first.

Fig (19)

Before creating the scripts, we must store them in appropriate locations where we can locate them easily while creating tasks and jobs. So in the Fig (19), under the workspace we created a folder named ‘HCD INCREMENTAL SCRIPTS’.

Fig (20)

Again, inside the folder we created a file which should be saved in .SQL format to make it SQL script. Fig (20).

Fig (21)

Here in Fig (21) it’s a sample of the SQL script using upsert method.

Fig (22)

Next, Go to the Navigation bar and select jobs & pipelines where you get the option for Ingestion pipeline, ETL pipeline and Job Fig (22) ,as our work is to only update and insert the new data for the tables so we proceed with the “Creation of job”.

Fig (23)

The above Fig (23) is the UI of the Job creation process. To start the job let’s start with adding tasks.

Click on the ‘Add another task type’ as the SQL file is not in the above option.

Fig (24)

In this option Fig (24) we find the SQL file option, there are also many options which the user can select according to the requirement.

Fig (25)

In the above Fig (25), after clicking on SQL file, we have to give the task name and a point to remember- in a current job the task name should be unique.
Then set the path by selecting workspace🡪 HCD INCREMENTAL LOAD🡪 SQL scripts.

Set the default SQL warehouses

Fig (26)

In Fig (26) we can add notifications which will be sent on success or failure etc to a specified destination like mail.
We can set how many times it should rerun before sending us the failure message.

Fig (27)

The above Fig (27) is the view of the notification field where we can set the destination as mail or any available, appropriate to get notify for start, success, failure, duration warning and streaming backlog.

Fig (28)

To add another task in the job, click on the add task just below the first task and again select the SQL file as shown in Fig (28).

Fig (29)

Similarly fill the next task with all the details as the previous one but here comes the main step, first the current task should be dependent on the previous one so that they will be connected and run in sequence.

Then Run if dependencies have to be selected wisely from the options:

All succeeded – Means the previous tasks of the current tasks should be successfully executed.
At least one succeeded – Here this task runs if at-least one of the previous task should succeed, if all fail then current task will not get executed.
None failed – Here the current task will run if none of the dependencies fail and at-least one should be executed and succeed, if skipped then fine.
All done – it’s the most flexible type of dependency as if the previous tasks fail, succeed or skips the current one executes.
At-least one failed – Here the current task only runs if from the previous dependent task at least one should fail, if all succeeds this will not get executed.
All failed – The current task will only run if all the dependent tasks fail, if even one succeeds it will not executed.
In our case I choose ‘All done’ as I want all the tables to get refreshed irrespective of the previous table status Fig (29).

Fig (30)

After creating the tasks we can see the tasks grouped in the panel Fig (30). Now to avoid the manual run time and effort we schedule the job so it runs automatically.

Fig (31)

Now inside “Schedules and Triggers” Fig (31) we set the Trigger Status to ‘Active’, Trigger type to ‘Scheduled’ with the period of every one day. We can manually run it any time but this helps to set the automatic run sequence.

Fig (32)

Previously we had set notification for the specific task but here we are setting the notifications for the whole job. Means if the complete job fails, succeeds or gets skipped everything Fig (32).

Fig (33)

This is the view for the job notification Fig (33). After setup just click on the ‘Run Now’ button as shown in the Fig (29).

Fig (34)

After clicking on Run Now, we can view the run as to get the graph or list view of the tasks getting executed shown in Fig (34).

Fig (35)

Fig (35) is the ‘Timeline view’ that shows which task is taking how much time to get executed. Also it shows the dependencies and sequence that when a task is executing the next task is blocked, only will start when the previous one is done.

Fig (36)

In the Fig (36) it is the ‘Graph view’ for the Tasks that are been executed showing getting executed, failed or succeeded

Fig (37)

Fig (37) is the ‘List view’ for the tasks executing.

HENCE: Here we covered all the path about How to connect to database, load data, apply query to them, create insight dashboards, do incremental loading and auto set the incremental loads through jobs.

CONCLUSIONS

This blog demonstrates a complete, end-to-end data journey — from a traditional Oracle database to a modern, interactive Databricks dashboard. By establishing a secure connection, organizing the data into well-structured catalogs and schemas, and converting tables into Delta format, the healthcare dataset was made highly accessible, scalable, and optimized for analytics.

The use of incremental loading with upsert logic ensures that the system processes only new or updated records instead of reloading the entire dataset every time. This makes the solution more efficient, cost-effective, and production-ready. Automating the process using Databricks Jobs further eliminates manual effort and guarantees consistent data refresh at scheduled intervals.

Finally, by transforming the prepared data into well-defined datasets and visualizing them through Databricks dashboards, complex healthcare data is converted into clear, actionable insights. This approach not only enhances reporting efficiency but also supports faster, more informed decision-making.

Overall, this workflow provides a scalable, automated, and enterprise-ready framework for modern data engineering and analytics in the healthcare domain.