Usage
First, create a repository with a global configuration file that you or your organization will be using. The repository
should contain dp.yml.tmpl
file looking similar to this:
templates:
my-first-template:
template_name: my-first-template
template_path: https://github.com/<YOUR_USERNAME>/<YOUR_TEMPLATE>.git
vars:
username: YOUR_USERNAME
Thanks to the copier, you can leverage Jinja template syntax to create
easily modifiable configuration templates. Just create a copier.yml
file next to the dp.yml.tmpl
one and configure
the template questions (read more at copier documentation).
Then, run dp init <CONFIG_REPOSITORY_URL>
to initialize dp. You can also drop <CONFIG_REPOSITORY_URL>
argument,
dp will get initialized with an empty config.
Project creation
You can use dp create <NEW_PROJECT_PATH>
to choose one of the templates added before and create the project in the
<NEW_PROJECT_PATH>
directory.
You can also use dp create <NEW_PROJECT_PATH> <LINK_TO_TEMPLATE_REPOSITORY>
to point directly to a template
repository. If <LINK_TO_TEMPLATE_REPOSITORY>
proves to be the name of the template defined in dp’s config file,
dp create
will choose the template by the name instead of trying to download the repository.
dp template-list
lists all added templates.
Project update
To update your pipeline project use dp update <PIPELINE_PROJECT-PATH>
. It will sync your existing project with updated
template version selected by --vcs-ref
option (default HEAD
).
Project configuration
dp as a tool depends on a few files in your project directory. In your project directory, it must be able to find a
config
directory with a structure looking similar to this:
Whenever you call dp’s command with the --env <ENV>
flag, the tool will search for dbt.yml
and
<TARGET_TYPE>.yml
files in base
and <ENV>
directory and parse important info out of them, with <ENV>
settings taking precedence over those listed in base
. So, for example, for the following files:
# config/base/dbt.yml
target: env_execution
target_type: bigquery
# config/base/bigquery.yml
method: oauth
project: my-gcp-project
dataset: my-dataset
threads: 1
# cat config/dev/bigquery.yml
dataset: dev-dataset
dp test --env dev
will run dp test
command using values from those files, most notably with dataset: dev-dataset
overwriting
dataset: my-dataset
setting.
dp synthesizes dbt’s profiles.yml
out of those settings among other things. However, right now it only creates
local
or env_execution
profile, so if you want to use different settings amongst different environments, you
should rather use {{ env_var('VARIABLE') }}
as a value and provide those settings as environment variables. E.g., by
setting those in your config/<ENV>/k8s.yml
file, in envs
dictionary:
# config/base/bigquery.yml
method: oauth
dataset: "{{ env_var('GCP_DATASET') }}"
project: my-gcp-project
threads: 1
# config/base/k8s.yml
# ... Kubernetes settings ...
# config/dev/k8s.yml
envs:
GCP_DATASET: dev-dataset
# config/prod/k8s.yml
envs:
GCP_DATASET: prod-dataset
target
and target_type
target
setting inconfig/<ENV>/dbt.yml
should be set either tolocal
orenv_execution
;target_type
defines which backend dbt will use and what file dp will search for; exampletarget_types
arebigquery
orsnowflake
.
Variables
You can put a dictionary of variables to be passed to dbt
in your config/<ENV>/dbt.yml
file, following the convention
presented in the guide at the dbt site.
E.g., if one of the fields of config/<SNOWFLAKE_ENV>/snowflake.yml
looks like this:
schema: "{{ var('snowflake_schema') }}"
you should put the following in your config/<SNOWFLAKE_ENV>/dbt.yml
file:
vars:
snowflake_schema: EXAMPLE_SCHEMA
and then run your dp run --env <SNOWFLAKE_ENV>
(or any similar command).
You can also add “global” variables to your dp config file $HOME/.dp.yml
. Be aware, however, that those variables
get erased on every dp init
call. It is a great idea to put commonly used variables in your organization’s
dp.yml.tmpl
template and make copier ask for those when initializing dp. By doing so, each member of your
organization will end up with a list of user-specific variables reusable across different projects on its machine.
Just remember, global-scoped variables take precedence over project-scoped ones.
Project compilation
dp compile
prepares your project to be run on your local machine and/or deployed on a remote one.
Local run
When you get your project configured, you can run dp run
and dp test
commands.
dp run
runs the project on your local machine,dp test
run tests for your project on your local machine.
Project deployment
dp deploy
will sync with your bucket provider. The provider will be chosen automatically based on the remote URL.
Usually, it is worth pointing dp deploy
to a JSON or YAML file with provider-specific data like access tokens or project
names. The provider-specific data should be interpreted as the **kwargs
(keyword arguments) expected by a specific
fsspec’s FileSystem implementation. One would most likely want to
look at the S3FileSystem or
GCSFileSystem documentation.
E.g., to connect with Google Cloud Storage, one should run:
echo '{"token": "<PATH_TO_YOUR_TOKEN>", "project_name": "<YOUR_PROJECT_NAME>"}' > gs_args.json
dp deploy --dags-path "gs://<YOUR_GS_PATH>" --blob-args gs_args.json
However, in some cases, you do not need to do so, e.g. when using gcloud with properly set local credentials. In such
a case, you can try to run just the dp deploy --dags-path "gs://<YOUR_GS_PATH>"
command and let gcsfs
search for
the credentials.
Please refer to the documentation of the specific fsspec
’s implementation for more information about the required
keyword arguments.
dags-path
as config argument
You can also list your path in the config/base/airflow.yml
file, as a dags_path
argument:
dags_path: gs://<YOUR_GS_PATH>
# ... rest of the 'airflow.yml' file
In such a case, you do not have to provide a --dags-path
flag, and you can just call dp deploy
instead.
Packing and publishing
The built project can be processed to a dbt package by calling dp publish
. dp publish
parses manifest.json
and prepares a package that lists models outputted by transformations, saving it in the build/package
directory.
Preparing dbt environment
Sometimes you would like to use standalone dbt or an application that interfaces with it (like VS Code plugin).
dp prepare-env
prepares your local environment to be more conformant with standalone dbt requirements, e.g.,
by saving profiles.yml
in the home directory.
However, be aware that most of the time you do not need to do so, and you can comfortably use dp run
and dp test
commands to interface with the dbt instead.
Clean project
When finished, call dp clean
to remove compilation-related directories.