Spread the love

Unity Catalog is databrick’s centralized governance solution to manage all data and AI assests like tables, files and dashboard.It mainly helps to:

Organize Data-Like folders for data table.
Secure Data-like it controls who can access what.
Track Data-it tracks where the data came from and how it is used.

Databricks has evolved rapidly to handle both big data engineering and AI workloads. At the heart of this evolution lies how data is stored, cataloged, and governed. Over time, three key components have shaped Databricks’ approach:

DBFS (Databricks File System) – foundational file storage.
Hive Metastore – legacy metadata management.
Unity Catalog – modern, enterprise-grade governance.

What is DBFS?

DBFS (Databricks File System) is a distributed file system mounted on top of cloud storage (AWS S3, Azure Blob, GCP Storage). It allows storing raw files (CSV, JSON, Parquet, etc.), notebooks, and libraries.

Easy to use, cluster-integrated.
Lacks governance and metadata management.
Use case: Staging, experimentation, file-level data storage.

What is Hive Metastore?

Before Unity Catalog, Hive Metastore (HMS) was the default metadata store in Databricks. It maintained information about tables, schemas, and data types, enabling SQL queries across structured data.

Provided a shared metadata store for SQL & Spark.
Allowed creation of tables, schemas, and queries on top of DBFS or external storage.
Limited access control (mostly database/table level).
Tied to a single workspace, not cross-workspace.
No data lineage or advanced governance.
Use case: Basic metadata management for analytics.
It’s workspace-scoped i.e it is limited to one workspace.

What is Unity Catalog?

Unity Catalog is Databricks’ unified governance layer for data and AI assets. It replaces Hive metastore by providing a single, centralized place to manage:

Access control (who can read/write data)
Data organization (how data is structured)
Data lineage (where data came from, how it was transformed, where it’s used)
Works for structured, unstructured, and ML/AI assets.
Needs configuration and migration from legacy Hive Metastore.
Use case: Enterprise-scale governance and compliance.
Unity Catalog governance extends across all compute types (clusters, SQL warehouses, ML runtimes)

If we get into simple words it can be described as:
Unity Catalog is like a security guard + librarian in your Databricks workspace.

Security guard → decides who can open which doors (who can access which data).
Librarian → organizes data into shelves and sections so you can easily find and use it.

Why Unity Catalog Matters?

In traditional Databricks setups (before Unity Catalog):

Permissions were workspace-specific → no global consistency.
Governance was fragmented between tables, files, ML models.
Auditing and lineage tracking were limited.

Unity Catalog addresses these gaps by providing:

Centralized administration of access policies i.e security rules are set once in unity catalog , and they are applied everywhere across all workspaces and compute types.
Consistent governance across multiple workspaces and clouds, so that a single governance policy is applied everywhere.
Fine-grained control (down to row/column level if required), it means admins can give the right level of access all the way down to specific columns and rows.

Architecture:

Unity Catalog has a layered architecture that manages governance top-down.

1. Metastore (Top Layer)

The regional container for governance and metadata.
Stores all metadata, permissions, and audit logs.
Each organization can have multiple metastores, but a workspace can be linked to one metastore.

2. Catalogs

Top-level namespace inside the metastore.
Logical grouping of schemas for departments or domains.
Like finance_catalog , marketing catalog etc.

3. Schemas

Like a database inside a catalog.
Organize tables, views, and functions within a catalog.
Like inside finance_catalog , we can have transactions_schema, reporting_schema.

4. Data Assets (Tables, Views, Functions, ML Models)

Tables → Store structured data.
Views → Saved Queries on top of tables.
Functions → UDFs for transformations.
ML Models / Files → Also governed under Unity Catalog.

5. Security & Access Control Layer

Identity Integration → Works with identity providers (Azure AD, Okta, SCIM).
Permissions → Role-based, can be set at catalog, schema, table, column level.
Auditing → Every access is logged.

6. Data Lineage Layer

Tracks how data flows from source → table → transformation → downstream dashboards/ML.
Provides end-to-end visibility for compliance and debugging.
DBFS VS Legacy Hive Metastore VS Unity Catalog

So, basically-
DBFS – a storage layer for raw files (CSV, JSON, Parquet) without governance or metadata.
Hive Metastore – It was the default metadata store before Unity Catalog. Each workspace had its own HMS, making cross-workspace sharing difficult.
Unity Catalog – a centralized, multi-cloud governance solution with fine-grained access control, lineage, audit logging, and support for all data and AI assets. Unity Catalog governance extends across all compute types (clusters, SQL warehouses, ML runtimes)

Setup of Unity catalog in Databricks:

Unity Catalog only works in the trial version of Databricks for 14 days and forever in the premium edition. Let’s now go through the setup process.

Enable Unity Catalog
This is the first step where you turn on Unity Catalog in your Databricks workspace. It unlocks centralized governance, security, and lineage features. By default , it may be already enabled as per the updated databricks.
Create a Metastore
The metastore acts as the central storage of metadata (catalogs, schemas, tables). Each workspace links to one metastore.
Create a Catalog
A catalog is the highest-level container for organizing data assets. Think of it as a folder that groups schemas and tables under governance.
Create a Schema
A schema (similar to a database) holds your tables and views inside a catalog, making it easier to organize datasets logically.

Now let’s continue for the Unity catalog setup in the premium version of Databricks:

Fig (1)

The above fig (1) is the workspace UI of Databricks trial version. We can check about the trial and the credits in the manage trial.

Unity catalog is already enabled in the trial version of databricks. For setting up the workspace with unity catalog we have to go step by step as shown in the above flowchart.
Click on the workspace dropdown just at the top right corner.

Fig (2)

From the dropdown choose manage account to go to creation of metastore.

Fig (3)

The above fig (3) is the navigation bar also shown in the view like:

Workspaces
Catalog
Usage
User management
Cloud resources
Previews
Settings

Choose the Catalog option from the listed menu to go to the creation process of the metastore.

Fig (4)

After the signup in the free trial, a metastore is already assigned to your workspace indicating the auto enabling of the unity catalog.

In case you want to create another metastore you can click on the ‘Create metastore’.

Fig (5)

The above is another created metastore. A point to remember is only workspace with the same region so as the metastore region can be assigned.

There is also a checkbox available which shows that we can enable auto assigning the metastore to the workspaces created with the same region, here it is ‘us-east-2’.

Fig (6)

Now after the metastore creation the next step is creating catalog. As the workspace is already assigned to unity catalog metastore, so all the about to create catalogs are also by default assigned to the metastore.
To create a new catalog, go to catalog and create a new catalog named trial_catalog like in this example.

Fig (7)

Inside the trial_catalog add the required schemas to store the data in delta files to view as table.

Fig (8)

The above fig (8) are the required schemas we need for a demo project and is added. They are:
raw_schema (raw ingested data)
processed_schema (cleaned/transformed data)
analytics_schema (tables for BI/Dashboards)

As it is a 14 days trial, still the notebook runs on serverless compute, and the workspace has been successfully attached to the unity catalog.

Below is a simple workflow in the unity catalog enabled notebook:

Before understanding the flow, remember the unity catalog is used to govern, oraganize data and grant or revoke access, the following is the only a single user level work is done.

Sample work:

Assuming data is ingested in volumes just as mentioned in previous databricks blogs, now further-

Fig (9)

In the above fig (9) we convereted the csv data into dataframe named
Drugs_package.csv –> df_drugs_package
Drugs_product.csv –> df_drugs_product
Drugs_unfinished_package.csv –> df_unfinished_package
Drugs_unfinished_products.csv –> df_unfinished_products

Now the data frames are loaded into the raw schema in the delta format making it accessible irrespective of any format in the databricks.

Note-A schema that is completely loaded with the delta tables or can be incrementally loaded is known as Delta Lake, where we can access the delta tables, analyze them and save the processed table in delta format. The admin and enabled user’s can access, view and analyze respective to the permissions granted.
The data frames loaded into the raw_schema in delta format into the respective catalog and schema as:
df_drugs_package -> trial_catalog.raw_schema.drugs_package
df_drugs_product -> trial_catalog.raw_schema.drugs_product
df_unfinished_package -> trial_catalog.raw_schema.drugs_unfinished_package
df_unfinished_products -> trial_catalog.raw_schema.drugs_unfinished_product

Fig (10)

Fig (11)

In the fig (10) and fig (11) the data is again loaded into the notebook as dataframes but the tables are in delta format and is obtained from the raw_schema.

Fig (12)

In fig (12) the data are now cleaned, transformed and other required operations are applied.

Fig (13)

In the above fig (13) again the cleaned data frames are now loaded into the cleaned schema inside the trial_catalog as:
df_cleaned_product – trial_catalog.processed_schema.drugs_product
df_cleaned_package – trial_catalog.processed_schema.drugs_package
df_cleaned_unfinished_product – trial_catalog.processed_schema.drugs_unfinished_product
df_cleaned_unfinished_package – trial_catalog.processed_schema.drugs_unfinished_package

Sample Anlysis of the datasets:

Fig (14)

Fig (15)

Fig (16)

The above figures (14, 15, 16) are some of the analysis sample work as the before blogs shown in details.

Sample Dashboard Visualization:

Fig (17)

The above fig (17) is the sample of dashboard creation in databricks free trial which was not enabled in free edition.

Additional Information:

Fig (18)

In the above fig (18) we are about to create a SQL warehouse as per the requirement. Creation of a cluster is only available in paid version. The SQL warehouse is only used to run sql queries and sql related analysis whereas cluster is used to run python, Pyspark, ML etc queries. However the creation of cluster and warehouse is however same.

Fig (19)

As we can see in the above fig (19), we can define the cluster size (here 2x small costing 4 DataBricks Unit per hour), setting the auto stop time means the cluster will be cut off automatically when not in use within a specified time period.

Fig (20)

In the notebook before we start executing SQL queries or analysis, connect and attach the SQL warehouse compute named ‘trial_dp_cluster’ created by the user.

Conclusion:

Unity Catalog in Databricks provides a simplified yet powerfull way to experience centralized governanace. It simplifies and secures data management across the lakehouse.
Instead of managing permissions and data separately in each workspace, Unity Catalog provides:

Centralized governance → One place to define data permissions, roles, and access policies.
Data discovery → A common metastore with schemas, tables, and views that can be shared across teams.
Fine-grained security → Row-level and column-level access control.
Cross-workspace collaboration → Same data accessible across multiple Databricks workspaces without duplication.
Audit & lineage → Full visibility of who accessed what data, and where the data came from.

So, In this blog we got a brief definition on unity catalog.
In short: Unity Catalog makes Databricks more enterprise-ready, ensuring data security, compliance, and easier collaboration across teams.