Connect to dbt Core
Note
This article covers dbt Core, a version of dbt for your local development machine that interacts with Databricks SQL warehouses and Databricks clusters within your Databricks workspaces. To use the hosted version of dbt (calleddbt Cloud) instead, or to use
dbt (data build tool) is a development environment that enables data analysts and data engineers to transform data by simply writing select statements. dbt handles turning these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Databricks. dbt supports collaborative coding patterns and best practices such as version control, documentation, modularity, and more.
dbt does not extract or load data. dbt focuses on the transformation step only, using a “transform after load” architecture. dbt assumes that you already have a copy of your data in your database.
This article focuses on using dbt Core. dbt Core enables you to write dbt code in the text editor or IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). Thedbt CLIis free to use andopen source.
A hosted version of dbt called dbt Cloud is also available. dbt Cloud comes equipped with turnkey support for scheduling jobs, CI/CD, serving documentation, monitoring and alerting, and an integrated development environment (IDE). For more information, seeConnect to dbt Cloud.The dbt Cloud Developer plan provides one free developer seat; Team and Enterprise paid plans are also available. For more information, seedbt Pricingon the dbt website.
Because dbt Core and dbt Cloud can use hosted git repositories (for example, on GitHub, GitLab or BitBucket), you can use dbt Core to create a dbt project and then make it available to your dbt Cloud users. For more information, seeCreating a dbt projectandUsing an existing projecton the dbt website.
For a general overview of dbt, watch the following YouTube video (26 minutes).
Requirements
Before you install dbt Core, you must install the following on your local development machine:
Step 1: Create and activate a Python virtual environment
In this step, you usepipenv
to create aPython virtual environment.我們建議使用Python的虛擬環境as it isolates package versions and code dependencies to that specific environment, regardless of the package versions and code dependencies within other environments. This helps reduce unexpected package version mismatches and code dependency collisions.
From your terminal, switch to an empty directory, creating that directory first if necessary. This procedure creates an empty directory named
dbt_demo
in the root of your user home directory.mkdir ~/dbt_democd~/dbt_demo
mkdir %USERPROFILE%\dbt_democd%USERPROFILE%\dbt_demo
In this empty directory, create a file named
Pipfile
with the following content. ThisPipfileinstructspipenv
to use Python version 3.8.6. If you use a different version, replace3.8.6
with your version number.[[source]] url = "https://pypi.org/simple" verify_ssl = true name = "pypi" [packages] dbt-databricks = "*" [requires] python_version = "3.8.6"
Note
The preceding line
dbt-databricks="*"
instructspipenv
to use the latest version of thedbt-databricks
package. In production scenarios, you should replace*
with the specific version of the package that you want to use. Seedbt-databricks Release historyon the Python Package Index (PyPI) website.Create a Python virtual environment in this directory by running
pipenv
and specifying the Python version to use. This command specifies Python version 3.8.6. If you use a different version, replace3.8.6
with your version number:pipenv --python3.8.6
Install the dbt Databricks adapter by running
pipenv
with theinstall
option. This installs the packages in yourPipfile
, which includes the dbt Databricks adapter package,dbt-databricks
, from PyPI. The dbt Databricks adapter package automatically installs dbt Core and other dependencies.Important
If your local development machine uses any of the following operating systems, you must complete additional steps first: CentOS, MacOS, Ubuntu, Debian, and Windows. See the “Does my operating system have prerequisites” section ofUse pip to install dbton the dbt Labs website.
pipenv install
Activate this virtual environment by running
pipenvshell
.To confirm the activation, the terminal displays(dbt_demo)
before the prompt. The virtual environment begins using the specified version of Python and isolates all package versions and code dependencies within this new environment.pipenv shell
Note
To deactivate this virtual environment, run
exit
.(dbt_demo)
disappears from before the prompt. If you runpython--version
orpiplist
with this virtual environment deactivated, you might see a different version of Python, a different list of available packages or package versions, or both.Confirm that your virtual environment is running the expected version of Python by running
python
with the--version
option.python --version
If an unexpected version of Python displays, make sure you have activated your virtual environment by running
pipenvshell
.Confirm that your virtual environment is running the expected versions of dbt and the dbt Databricks adapter by running
dbt
with the--version
option.dbt --version
If an unexpected version of dbt or the dbt Databricks adapter displays, make sure you have activated your virtual environment by running
pipenvshell
.If an unexpected version still displays, try installing dbt or the dbt Databricks adapter again after you activate your virtual environment.
Step 2: Create a dbt project and specify and test connection settings
In this step, you create a dbtproject, which is a collection of related directories and files that are required to use dbt. You then configure your connectionprofiles, which contain connection settings to a Databrickscluster, aSQL warehouse, or both. To increase security, dbt projects and profiles are stored in separate locations by default.
Tip
You can connect to an existing cluster or SQL warehouse, or you can create a new one.
An existing cluster or SQL warehouse can be efficient for multiple dbt projects, for using dbt in a team, or for development use cases.
A new cluster or SQL warehouse allows you to run a single dbt project in isolation for production use cases, as well as leverage automatic termination to save costs when that dbt project is not running.
Use Databricks to create a new cluster or SQL warehouse, and then reference the newly-created or existing cluster or SQL warehouse from your dbt profile.
With the virtual environment still activated, run thedbt initcommand with a name for your project. This procedure creates a project named
my_dbt_demo
.
dbt init my_dbt_demo
When you are prompted to choose a
databricks
orspark
database, enter the number that corresponds todatabricks
.When prompted for a
host
value, do the following:For a cluster, enter theServer Hostnamevalue from theAdvanced Options, JDBC/ODBCtab for your Databricks cluster.
For a SQL warehouse, enter theServer Hostnamevalue from theConnection Detailstab for your SQL warehouse.
When prompted for an
http_path
value, do the following:For a cluster, enter theHTTP Pathvalue from theAdvanced Options, JDBC/ODBCtab for your Databricks cluster.
For a SQL warehouse, enter theHTTP Pathvalue from theConnection Detailstab for your SQL warehouse.
When prompted for a
token
, enter the value of your Databrickspersonal access token.Note
As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging toservice principalsinstead of workspace users. To create access tokens for service principals, seeManage access tokens for a service principal.
When prompted for the
desiredUnityCatalogoption
value, enter the number that corresponds withuseUnityCatalog
ornotuseUnityCatalog
.If you chose to use Unity Catalog, enter the desired value for
catalog
when prompted.Enter the desired values for
schema
andthreads
when prompted.dbt writes your entries to a
profiles.yml
file. The location of this file is listed in the output of thedbtinit
command. You can also list this location later by running thedbtdebug--config-dir
command. You can open this file now to examine and verify its contents.Confirm that the connection details are correct by running the
dbtdebug
command.Important
Before you begin, verify that your cluster or SQL warehouse is running.
You should see output similar to the following:
dbt debug
...Configuration:profiles.yml file [OK found and valid]dbt_project.yml file [OK found and valid]Required dependencies:- git [OK found]Connection:...Connection test: OK connection ok
Step 3: Create and run models
In this step, you use your favorite text editor to createmodels, which areselect
statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the samplediamonds
table from theSample datasets, as described in theCreate a tablesection ofTutorial: Query data with notebooks.This procedure assumes this table has already been created in your workspace’sdefault
database.
In the project’s
models
directory, create a file nameddiamonds_four_cs.sql
with the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from thediamonds
table. Theconfig
block instructs dbt to create a table in the database based on this statement.{{config(materialized=“表”,file_format='delta')}}
selectcarat,cut,color,clarityfromdiamonds
Tip
For additional
config
options such as using the Delta file format and themerge
incremental strategy, seeApache Spark configurationson the dbt website and the “Model Configuration” and “Incremental Models” sections of theUsage Notesin the dbt-labs/dbt-spark repository in GitHub.In the project’s
models
directory, create a second file nameddiamonds_list_colors.sql
with the following SQL statement. This statement selects unique values from thecolors
column in thediamonds_four_cs
table, sorting the results in alphabetical order first to last. Because there is noconfig
block, this model instructs dbt to create a view in the database based on this statement.select distinct color from {{ ref('diamonds_four_cs') }} sort by color asc
In the project’s
models
directory, create a third file nameddiamonds_prices.sql
with the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement.selectcolor,avg(price)aspricefromdiamondsgroupbycolororderbypricedesc
With the virtual environment activated, run the
dbtrun
command with the paths to the three preceding files. In thedefault
database (as specified in theprofiles.yml
file), dbt creates one table nameddiamonds_four_cs
and two views nameddiamonds_list_colors
anddiamonds_prices
.dbt gets these view and table names from their related.sql
file names.dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql
...... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]... | 3 of 3 START view model default.diamonds_prices...................... [RUN]... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]... |... | Finished running 1 table model, 2 view models ...Completed successfullyDone. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
Run the following SQL code to list information about the new views and to select all rows from the table and views.
If you are connecting to a cluster, you can run this SQL code from anotebookthat is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from aquery.
SHOWviewsINdefault;
+-----------+----------------------+-------------+| namespace | viewName | isTemporary |+===========+======================+=============+| default | diamonds_list_colors | false |+-----------+----------------------+-------------+| default | diamonds_prices | false |+-----------+----------------------+-------------+
SELECT*從diamonds_four_cs;
+-------+---------+-------+---------+| carat | cut | color | clarity |+=======+=========+=======+=========+| 0.23 | Ideal | E | SI2 |+-------+---------+-------+---------+| 0.21 | Premium | E | SI1 |+-------+---------+-------+---------+...
SELECT*從diamonds_list_colors;
+-------+| color |+=======+| D |+-------+| E |+-------+...
SELECT*從diamonds_prices;
+-------+---------+| color | price |+=======+=========+| J | 5323.82 |+-------+---------+| I | 5091.87 |+-------+---------+...
Step 4: Create and run more complex models
In this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models.
Run the following SQL code to create the necessary data tables.
If you are connecting to a cluster, you can run this SQL code from anotebookthat is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from aquery.
The tables and views in this step start with
zzz_
幫助確定他們是其中的一部分example. You do not need to follow this pattern for your own tables and views.DROPTABLEIFEXISTSzzz_game_opponents;DROPTABLEIFEXISTSzzz_game_scores;DROPTABLEIFEXISTSzzz_games;DROPTABLEIFEXISTSzzz_teams;CREATETABLEzzz_game_opponents(game_idINT,home_team_idINT,visitor_team_idINT)USINGDELTA;INSERTINTOzzz_game_opponentsVALUES(1,1,2);INSERTINTOzzz_game_opponentsVALUES(2,1,3);INSERTINTOzzz_game_opponentsVALUES(3,2,1);INSERTINTOzzz_game_opponentsVALUES(4,2,3);INSERTINTOzzz_game_opponentsVALUES(5,3,1);INSERTINTOzzz_game_opponentsVALUES(6,3,2);-- Result:-- +---------+--------------+-----------------+-- | game_id | home_team_id | visitor_team_id |-- +=========+==============+=================+-- | 1 | 1 | 2 |-- +---------+--------------+-----------------+-- | 2 | 1 | 3 |-- +---------+--------------+-----------------+-- | 3 | 2 | 1 |-- +---------+--------------+-----------------+-- | 4 | 2 | 3 |-- +---------+--------------+-----------------+-- | 5 | 3 | 1 |-- +---------+--------------+-----------------+-- | 6 | 3 | 2 |-- +---------+--------------+-----------------+CREATETABLEzzz_game_scores(game_idINT,home_team_scoreINT,visitor_team_scoreINT)USINGDELTA;INSERTINTOzzz_game_scoresVALUES(1,4,2);INSERTINTOzzz_game_scoresVALUES(2,0,1);INSERTINTOzzz_game_scoresVALUES(3,1,2);INSERTINTOzzz_game_scoresVALUES(4,3,2);INSERTINTOzzz_game_scoresVALUES(5,3,0);INSERTINTOzzz_game_scoresVALUES(6,3,1);-- Result:-- +---------+-----------------+--------------------+-- | game_id | home_team_score | visitor_team_score |-- +=========+=================+====================+-- | 1 | 4 | 2 |-- +---------+-----------------+--------------------+-- | 2 | 0 | 1 |-- +---------+-----------------+--------------------+-- | 3 | 1 | 2 |-- +---------+-----------------+--------------------+-- | 4 | 3 | 2 |-- +---------+-----------------+--------------------+-- | 5 | 3 | 0 |-- +---------+-----------------+--------------------+-- | 6 | 3 | 1 |-- +---------+-----------------+--------------------+CREATETABLEzzz_games(game_idINT,game_dateDATE)USINGDELTA;INSERTINTOzzz_gamesVALUES(1,'2020-12-12');INSERTINTOzzz_gamesVALUES(2,'2021-01-09');INSERTINTOzzz_gamesVALUES(3,'2020-12-19');INSERTINTOzzz_gamesVALUES(4,'2021-01-16');INSERTINTOzzz_gamesVALUES(5,'2021-01-23');INSERTINTOzzz_gamesVALUES(6,'2021-02-06');-- Result:-- +---------+------------+-- | game_id | game_date |-- +=========+============+-- | 1 | 2020-12-12 |-- +---------+------------+-- | 2 | 2021-01-09 |-- +---------+------------+-- | 3 | 2020-12-19 |-- +---------+------------+-- | 4 | 2021-01-16 |-- +---------+------------+-- | 5 | 2021-01-23 |-- +---------+------------+——| 6 | 2021-02-06 |-- +---------+------------+CREATETABLEzzz_teams(team_idINT,team_cityVARCHAR(15))USINGDELTA;INSERTINTOzzz_teamsVALUES(1,"San Francisco");INSERTINTOzzz_teamsVALUES(2,"Seattle");INSERTINTOzzz_teamsVALUES(3,"Amsterdam");-- Result:-- +---------+---------------+-- | team_id | team_city |-- +=========+===============+-- | 1 | San Francisco |-- +---------+---------------+-- | 2 | Seattle |-- +---------+---------------+-- | 3 | Amsterdam |-- +---------+---------------+
In the project’s
models
directory, create a file namedzzz_game_details.sql
with the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. Theconfig
block instructs dbt to create a table in the database based on this statement.-- Create a table that provides full details for each game, including-- the game ID, the home and visiting teams' city names and scores,-- the game winner's city name, and the game date.
{{config(materialized=“表”,file_format='delta')}}
-- Step 4 of 4: Replace the visitor team IDs with their city names.selectgame_id,home,t.team_cityasvisitor,home_score,visitor_score,-- Step 3 of 4: Display the city name for each game's winner.casewhenhome_score>visitor_scorethenhomewhenvisitor_score>home_scorethent.team_cityendaswinner,game_dateasdatefrom(-- Step 2 of 4: Replace the home team IDs with their actual city names.selectgame_id,t.team_cityashome,home_score,visitor_team_id,visitor_score,game_datefrom(-- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).selectg.game_id,go.home_team_id,gs.home_team_scoreashome_score,go.visitor_team_id,gs.visitor_team_scoreasvisitor_score,g.game_datefromzzz_gamesasg,zzz_game_opponentsasgo,zzz_game_scoresasgswhereg.game_id=go.game_idandg.game_id=gs.game_id)asall_ids,zzz_teamsastwhereall_ids.home_team_id=t.team_id)asvisitor_ids,zzz_teamsastwherevisitor_ids.visitor_team_id=t.team_idorderbygame_datedesc
In the project’s
models
directory, create a file namedzzz_win_loss_records.sql
with the following SQL statement. This statement creates a view that lists team win-loss records for the season.-- Create a view that summarizes the season's win and loss records by team. -- Step 2 of 2: Calculate the number of wins and losses for each team. select winner as team, count(winner) as wins, -- Each team played in 4 games. (4 - count(winner)) as losses from ( -- Step 1 of 2: Determine the winner and loser for each game. select game_id, winner, case when home = winner then visitor else home end as loser from {{ ref('zzz_game_details') }} ) group by winner order by wins desc
With the virtual environment activated, run the
dbtrun
command with the paths to the two preceding files. In thedefault
database (as specified in theprofiles.yml
file), dbt creates one table namedzzz_game_details
and one view namedzzz_win_loss_records
.dbt gets these view and table names from their related.sql
file names.dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql
...... | 1 of 2 START table model default.zzz_game_details.................... [RUN]... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]... |... | Finished running 1 table model, 1 view model ...Completed successfullyDone. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
Run the following SQL code to list information about the new view and to select all rows from the table and view.
If you are connecting to a cluster, you can run this SQL code from anotebookthat is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from aquery.
SHOWVIEWS從defaultLIKE'zzz_win_loss_records';
+-----------+----------------------+-------------+| namespace | viewName | isTemporary |+===========+======================+=============+| default | zzz_win_loss_records | false |+-----------+----------------------+-------------+
SELECT*從zzz_game_details;
+---------+---------------+---------------+------------+---------------+---------------+------------+| game_id | home | visitor | home_score | visitor_score | winner | date |+=========+===============+===============+============+===============+===============+============+| 1 | San Francisco | Seattle | 4 | 2 | San Francisco | 2020-12-12 |+---------+---------------+---------------+------------+---------------+---------------+------------+| 2 | San Francisco | Amsterdam | 0 | 1 | Amsterdam | 2021-01-09 |+---------+---------------+---------------+------------+---------------+---------------+------------+| 3 | Seattle | San Francisco | 1 | 2 | San Francisco | 2020-12-19 |+---------+---------------+---------------+------------+---------------+---------------+------------+| 4 | Seattle | Amsterdam | 3 | 2 | Seattle | 2021-01-16 |+---------+---------------+---------------+------------+---------------+---------------+------------+| 5 | Amsterdam | San Francisco | 3 | 0 | Amsterdam | 2021-01-23 |+---------+---------------+---------------+------------+---------------+---------------+------------+| 6 | Amsterdam | Seattle | 3 | 1 | Amsterdam | 2021-02-06 |+---------+---------------+---------------+------------+---------------+---------------+------------+
SELECT*從zzz_win_loss_records;
+---------------+------+--------+| team | wins | losses |+===============+======+========+| Amsterdam | 3 | 1 |+---------------+------+--------+| San Francisco | 2 | 2 |+---------------+------+--------+| Seattle | 1 | 3 |+---------------+------+--------+
Step 5: Create and run tests
In this step, you createtests, which are assertions you make about your models. When you run these tests, dbt tells you if each test in your project passes or fails.
There are two type of tests.Schema tests, applied in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass.Data testsare specific queries that must return zero records to pass.
In the project’s
models
directory, create a file namedschema.yml
with the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination.version:2models:-name:zzz_game_detailscolumns:-name:game_idtests:-unique-not_null-name:hometests:-not_null-accepted_values:values:['Amsterdam','SanFrancisco','Seattle']-name:visitortests:-not_null-accepted_values:values:['Amsterdam','SanFrancisco','Seattle']-name:home_scoretests:-not_null-name:visitor_scoretests:-not_null-name:winnertests:-not_null-accepted_values:values:['Amsterdam','SanFrancisco','Seattle']-name:datetests:-not_null-name:zzz_win_loss_recordscolumns:-name:teamtests:-unique-not_null-relationships:to:ref('zzz_game_details')field:home-name:winstests:-not_null-name:lossestests:-not_null
In the project’s
tests
directory, create a file namedzzz_game_details_check_dates.sql
with the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season.——本賽季的比賽之間發生2020-12-12and 2021-02-06. -- For this test to pass, this query must return no results. select date from {{ ref('zzz_game_details') }} where date < '2020-12-12' or date > '2021-02-06'
In the project’s
tests
directory, create a file namedzzz_game_details_check_scores.sql
with the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied.-- This sport allows no negative scores or tie games. -- For this test to pass, this query must return no results. select home_score, visitor_score from {{ ref('zzz_game_details') }} where home_score < 0 or visitor_score < 0 or home_score = visitor_score
In the project’s
tests
directory, create a file namedzzz_win_loss_records_check_records.sql
with the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed.-- Each team participated in 4 games this season. -- For this test to pass, this query must return no results. select wins, losses from {{ ref('zzz_win_loss_records') }} where wins < 0 or wins > 4 or losses < 0 or losses > 4 or (wins + losses) > 4
With the virtual environment activated, run the
dbttest
command.dbttest--models zzz_game_details zzz_win_loss_records
...... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]...... |... | Finished running 19 tests ...Completed successfullyDone. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19
Step 6: Clean up
You can delete the tables and views you created for this example by running the following SQL code.
If you are connecting to a cluster, you can run this SQL code from anotebookthat is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from aquery.
DROPTABLEzzz_game_opponents;DROPTABLEzzz_game_scores;DROPTABLEzzz_games;DROPTABLEzzz_teams;DROPTABLEzzz_game_details;DROPVIEWzzz_win_loss_records;DROPTABLEdiamonds;DROPTABLEdiamonds_four_cs;DROPVIEWdiamonds_list_colors;DROPVIEWdiamonds_prices;
Next steps
Learn more about dbtmodels.
Learn more about how totestyour dbt projects.
Learn how to useJinja, a templating language, for programming SQL in your dbt projects.
Learn about dbtbest practices.
Learn aboutdbt Cloud, a hosted version of dbt.
Troubleshooting
This section addresses common issues when using dbt Core with Databricks.
General troubleshooting
SeeGetting helpon the dbt Labs website.