Spread the love

Chapter 1

DATA INGESTION

Data Ingestion is basically a process of bringing data into the system from various sources, so as we can process, analyze and store it.“Day-to-Day, businesses collect mountains of data — from sales transactions, customer feedback and even social media but these raw data is scattered everywhere like puzzle pieces in different rooms. Data ingestion brings all those pieces together into one place, so one can make sense of them.”
In this blog we will dive depth into how we can ingest data from a Customer Relationship Management (CRM) platform called “Salesforce” which is already present in the data ingestion bar in the navigation panel.

We are going to sign up for the 30 days trial provided by the Salesforce platform to learn the ingestion process.

Fig ( 1 )

To ingest data from Salesforce first you need to Sign-up with Salesforce, from there you can receive your credentials which is further required while validating the source connection.
Then move to setup and create a connection with the Databricks under the connected apps section in order to ingest data from salesforce to databricks.

Fig ( 2 )

In the setup we have to create a new connected app and enable it to be accessed by the external clients. In fig (2) we can clearly see the connected app name as ‘Databricks_Integration’ and its permitted as ‘all users may self-authorize’.The above is the required connected app required to connect the databricks with salesforce.
Consumer key and Secret Key are also obtained in this process but can be validated only in both premium versions.

Fig ( 3 )

In Order to ingest the data into the Data Lake, first of all we need to select the “Data Ingestion” present inside the “Data Engineering” navigation pane. Inside the Data Ingestion we can find the Databricks Connectors (refer to Fig (3)).From the Connectors we need to Select “SALESFORCE” Connector. After that we can setup the “pipeline” that fetches the data from Salesforce and loads it into the lakehouse (refer to Fig (4), Fig (5)).

Fig ( 4 )

First set a name for the pipeline, then choose where the logs and metadata need to be stored, then select a schema for your data, and then create a connection to Salesforce to create an authentication.

Fig ( 5 )

CATALOG-A Catalog that organizes your data assets.
SCHEMA-The schema(database) inside the catalog.
Creating a connection to the source needs the Salesforce credentials like username, password, security token.

Fig ( 6 )

Once above fields are filled then click on “create pipeline and continue” to move further

Fig ( 7 )

This is the second step to ingest data from salesforce.

This is where we can choose which Salesforce data (tables) we want to bring to the databricks. We can import all the tables at a time or else we can choose only a specific set of tables to import into the databricks. Beside each table, there is a checkbox selection button where we can choose to import only certain tables instead of the whole database. This will save storage and speed up processing.

Fig ( 8 )

This fig(8) refers that beside each table, there is a Column selection column where we can choose to import only certain columns instead of the whole table. This will save storage and speed up processing.

Fig ( 9 )

This fig (9) shows the third step to ingest data from salesforce.
Here we can decide where in databricks the ingested Salesforce will be stored. We can think it of a situation where data can be kept in a “folder” i.e schema inside a “cabinet” i.e catalog. Here the destination section shows all the catalogs and schemas we have access to.

A catalog is the top-level container (like a big cabinet).
A schema is a sub-container inside a catalog (like a folder).

Here we can also create a new schema so as to keep the Salesforce tables each time in different schema for easy access to the data.

Fig ( 10 )

This fig (10) shows the third and final step to ingest data from salesforce.
The Schedules section decides how often Databricks should automatically pull fresh data from Salesforce. A schedule is set to run every 6 hours. The job name for this schedule is Prisoft_Data_Integration job. This means:

Every 6 hours, Databricks will connect to Salesforce.
Fetch the latest data.
Store/update it in cp.default tables.

But in our case, these are the sample data provided by the CRM environment as free trial.
The Email recipient set to “Failure” defines the person will only receive email if the pipeline fails. If “Success” was also checked, then the person will receive an email every time the pipeline runs successfully.
Lastly Save and close – Just saves the setup without running right now. Save and run pipeline -saves the setup and immediately runs the ingestion job once, before the schedule takes over.

Fig ( 11 )

The above fig (11) is the “pipeline and jobs creation page” for Prisoft_Data_Integration ingestion job from Salesforce.
Label 1 shows the Main Table View:
This section lists all the tables currently being ingested from Salesforce into Databricks.
Columns here mean:

Name → Name of the table from Salesforce (e.g., event, eventfeed).
Catalog / Schema → Where the table is stored in Catalog (here: catalog cp, schema default).
Type → Streaming table means the data is continuously updated rather than static batch loads.
Duration → Time taken for the last update (e.g., 3s, 4s).
Upserted / Deleted → Number of rows inserted/updated or deleted during the last run.
Icons on the far right lets you open details or preview data.

So, in simple words- this is the live status of each Salesforce table as it is being synced.

Label 2 shows the Event Log:

This is a real-time activity log showing the exact steps happening in the pipeline.

Example:

user_action → The pipeline was manually triggered by celeasa@localhost.
create_update → The update process was started via an API call.
update_progress → The update went from WAITING_FOR_RESOURCES to INITIALIZING i.e (Databricks is preparing compute resources to run it).

So, in simple words, if something fails, check here to see why.

Label 3 shows the Pipeline Details Panel:

This shows static metadata about the pipeline:

Pipeline ID → Unique ID for this ingestion pipeline.
Pipeline type → Ingestion pipeline (pulling from an external source).
Connection → The saved Salesforce connection being used (prisoft_connection).
Run as → User account executing the pipeline (celeasa@localhost).
Tags → Labels you can add for organization (none added here).

Simply, this is more about configuration reference, not live progress.

Fig ( 12 )

This page (the graph view of the “Prisoft_Data_Integration” pipeline) helps to manage Salesforce table refreshes in Databricks. In Databricks Free Edition, this page has a “Select Tables for Refresh” button which is like a control panel for bringing in the latest Salesforce data. On the left, we have the original Salesforce object views, and on the right, we have their corresponding Delta tables in Databricks. When we pick a table and refresh it, Databricks uses its built-in serverless engine (no cluster setup needed) to pull the newest records from Salesforce and store them in the Lakehouse.

The green check marks and timestamps shows exactly when each table was last updated—perfect for confirming that the latest Salesforce changes have made it into Databricks. We can choose which tables to update, view the latest sync status, and add new ones.

Fig ( 13 )

After successful data ingestion we can move to the catalog navigation panel and can find all the ingested tables inside the respective catalog and schema we stored with in.

Fig ( 14 )

Fig (14) is a Databricks page that shows the compute configuration for a Delta Live Tables (DLT) pipeline execution. Delta Live Tables (DLT) is a feature in Databricks that helps to build automated, reliable and maintainable data pipelines without writing a lot of cluster management code.

But as we are using the Free Community Edition,

So we cannot manually create clusters.
Databricks shows the above page (fig (14)) because DLT internally needs a cluster, even if we cannot control it.
Here the compute is managed by Databricks, so most fields are locked/un-editable.

In Free Community Edition, when we run a pipeline

Databricks starts a hidden, pre-configured serverless cluster.
The runtime is fixed to a DLT compatible version.
Cluster runs the pipeline, writes Delta Tables and then terminates automatically.

So, the page basically shows a back-end serverless cluster that Databricks community edition is using to run the DLT pipeline. We can see it’s configuration but cannot change it. Databricks fully manages it.

At last, “Data Ingestion” isn’t just about moving data — it’s about making sure that the right data arrives in the right place, ready to deliver insights. With Databricks, we can turn scattered information into a reliable, real-time fuel source for analytics, AI, and smarter decisions.