Data Pipelines CLI
: CLI for data platform
Introduction
Data Pipelines CLI, also called DP tool, is a command-line tool providing an easy way to build and manage data pipelines based on dbt in an environment with GIT, Airflow, DataHub, VSCode, etc.
The tool can be used in any environment with access to shell and Python
installed.
data-pipelines-cli’s main task is to cover technical complexities and provides an abstraction over all components that take part in Data Pipelines creation and execution. Thanks to the integration with templating engine it allows Analytics Engineers to create and configure new projects. The tool also simplifies automation as it handles deployments and publications of created transformations.
Community
Although the tools were created by GetInData and used in their project it is open-sourced and everyone is welcome to use and contribute to making it better and more powerful.
Installation
Use the package manager pip to install data-pipelines-cli:
pip install data-pipelines-cli[<flags>]
Depending on the systems that you want to integrate with you need to provide different flags in square brackets. You can provide comma separate list of flags, for example:
pip install data-pipelines-cli[gcs,git,bigquery]
Depending on the data storage you have you can use:
bigquery
snowflake
redshift
postgres
If you need git integration for loading packages published by other projects or publish them by yourself you will need:
git
If you want to deploy created artifacts (docker images and DataHub metadata) add the following flags:
docker
datahub
These are not usually used by a person user.
If you need Business Intelligence integration you can use following options:
looker
Setup an environment
This section is for Data Engineers who will be preparing and administrating the whole environment. It describes steps that should be done to prepare the DP tool to be used in an organization with full potential.
Create Data Pipeline project template
The first thing that you need to do is to create a git repository with a project template used later to create multiple projects. The template should contain the whole directory structure and files used in your projects. Additionally, it should have a connection configuration to all components in your environment, CICD, and all other aspects specific to your company. Here you can find templates examples that you can adjust to your need: https://github.com/getindata/data-pipelines-template-example . Based on the template The Data Pipelines CLI will ask a user a series of questions to build the final project.
Thanks to the copier
you can leverage Jinja template syntax to create easily modifiable configuration templates.
Just create a copier.yml
and configure the template questions (read more at
copier documentation).
Create a template to setup a local environment
Working with Data Pipelines usually requires local variables to be set to run and test avoiding messing in shared environments (DEV, STAGE, PROD). To simplify working environment preparation we also decided to use templates that will ask a series of questions and generate local configuration in a home directory.
It requires a repository with a global configuration template file that you or your organization will be using.
The repository should contain dp.yml.tmpl
file looking similar to this:
_templates_suffix: “.tmpl” _envops:
autoescape: false block_end_string: “%]” block_start_string: “[%” comment_end_string: “#]” comment_start_string: “[#” keep_trailing_newline: true variable_end_string: “]]” variable_start_string: “[[”
- templates:
- my-first-template:
template_name: my-first-template template_path: https://github.com/<YOUR_USERNAME>/<YOUR_TEMPLATE>.git
- vars:
username: [[ YOUR_USERNAME ]]
The file must contain a list of available templates. The templates will be displayed and available for selection in Data Pipelines CLI. The next section contains variables that will be passed to the project whenever running in the configured environment. The same rules apply in template creation as for project templates.
Usage
This section is for Data Pipelines CLI
’s users. It will present how to use the tool and how it handles interaction
with the whole Data environment. Below diagram presents the sequence process how usually the toole is used and order in
which different commands are executed:

Preparing working environment
The first thing that needs to be done when starting Building Data Pipelines is to prepare the working environment. This step
can be done either on a local machine on any kind of Workbench (eg. JupyterLab). You will need a link from your
Data Engineer or Administrator to the template with initial configuration then, run dp init <CONFIG_REPOSITORY_URL>
to initialize dp. You can also drop <CONFIG_REPOSITORY_URL>
argument, dp will get initialized with an empty
config.
This step is done only the first time for each working environment you want to use.
Example:
In this example only one variable you will be asked for and it is going to be username which is used in many dp commands.
dp init https://github.com/getindata/data-pipelines-cli-init-example

Project creation
You can use dp create <NEW_PROJECT_PATH>
to choose one of the templates to create the project in the
<NEW_PROJECT_PATH>
directory.
You can also use dp create <NEW_PROJECT_PATH> <LINK_TO_TEMPLATE_REPOSITORY>
to point directly to a template
repository. If <LINK_TO_TEMPLATE_REPOSITORY>
proves to be the name of the template defined in dp’s config file,
dp create
will choose the template by the name instead of trying to download the repository.
After the template selection, you will be asked a series of predefined questions in the template. Answering them all will cause a new empty project to be generated. The project will be adjusted and personalized based on answers to the questions.
Example:
Following command starts project creation process.
dp create our-simple-project
Fist step after this command is template selection:

We can switch options by pressing up and down buttons and we can make a decision by pressing enter. After that, series of questions will be asked. Be aware that the name of the DP project should be composed of alpha-numeric signs and the _ sign. After answering these questions the tool will generate complete project.

Adapting working environment to VSCode
VSCode is recommended tool to work with dbt as you can add a plugin that makes the work more efficient. To configure
the plugin or integrate it with some other standalone application you will need to generate profiles.yml
file from the project.
dp prepare-env
prepares your local environment to be more conformant with standalone dbt requirements by saving
profiles.yml
in the home directory.
However, be aware that IDE usage is optional, and you can comfortably use dp run
and dp test
commands to interface with the dbt instead.
List all available templates
Execute dp template-list
to list all added templates.
Project update
Whenever the template change you can update your project using dp update <PIPELINE_PROJECT-PATH>
command.
It will sync your existing project with the updated template version selected by --vcs-ref
option (default HEAD
).
It may be very useful when the are some infrastructure changes in your organization and you need to upgrade all created projects (there can be hundreds of them).
Project compilation
dp compile
prepares your project to be run on your local machine and/or deployed on a remote one.
Local run
When you get your project created, you can run dp run
and dp test
commands.
dp run
runs the project on your local machine,dp test
run tests for your project on your local machine.
Both commands accept --env
parameter to select the execution environment. The default value is local
.
Example:
dp run
This process will look at the contents of the models directory and create coresponding tables or views in data storage.

Now after all the tables and views are created we can also check, if the models work as intended by running the tests.
dp test

dbt sources and automatic models creation
With the help of dbt-codegen and
dbt-profiler, one can easily generate source.yml
, source’s base
model SQLs, and model-related YAMLs. dp offers a convenient CLI wrapper around those functionalities.
First, add the dbt-codegen package to your packages.yml
file:
packages:
- package: dbt-codegen
version: 0.5.0 # or newer
Then, run dp generate source-yaml YOUR_DATASET_NAME
to generate source.yml
file in models/source
directory.
You can list more than one dataset, divided by space. After that, you are free to modify this file.
When you want to generate SQLs for your sources, run dp generate source-sql
. It will save those SQLs in the directory
models/staging/YOUR_DATASET_NAME
.
Finally, when you have all your models prepared (in the form of SQLs), run dp generate model-yaml MODELS_DIR
to
generate YAML files describing them (once again, you are not only free to modify them but also encouraged to do so!).
E.g., given such a directory structure:
dp generate model-yaml models/
will create models/staging/my_source/my_source.yml
,
models/staging/intermediate/intermediate.yml
, and models/presentation/presentation.yml
. Beware, however, this
command WILL NOT WORK if you do not have those models created in your data warehouse already. So remember to run
dp run
(or a similar command) beforehand.
If you add the dbt-profiler package to your packages.yml
file too, you can call
dp generate model-yaml --with-meta MODELS_DIR
. dbt-profiler will add a lot of profiling metadata to
descriptions of your models.
Project deployment
dp deploy
executes the deployment of a project. Depending on the configuration the command may execute different steps
described in this section. Please be aware that this command is meant for the CICD process and usually should be avoided as manual activity.
Blob storage synchronization
The main action of the dp deploy
command is synchronization with your bucket provider. The provider will be chosen automatically based on the remote URL.
Usually, it is worth pointing dp deploy
to a JSON or YAML file with provider-specific data like access tokens or project
names. The provider-specific data should be interpreted as the **kwargs
(keyword arguments) expected by a specific
fsspec’s FileSystem implementation. One would most likely want to
look at the S3FileSystem or
GCSFileSystem documentation.
E.g., to connect with Google Cloud Storage, one should run:
echo '{"token": "<PATH_TO_YOUR_TOKEN>", "project_name": "<YOUR_PROJECT_NAME>"}' > gs_args.json
dp deploy --dags-path "gs://<YOUR_GS_PATH>" --blob-args gs_args.json
However, in some cases, you do not need to do so, e.g. when using gcloud with properly set local credentials. In such
a case, you can try to run just the dp deploy --dags-path "gs://<YOUR_GS_PATH>"
command and let gcsfs
search for
the credentials.
Please refer to the documentation of the specific fsspec
’s implementation for more information about the required
keyword arguments.
You can also provide your path in the config/base/airflow.yml
file, as a dags_path
argument:
dags_path: gs://<YOUR_GS_PATH>
# ... rest of the 'airflow.yml' file
In such a case, you do not have to provide a --dags-path
flag, and you can just call dp deploy
instead.
Docker image
dp deploy
command builds Docker image with dbt and project and sends it go Docker Registry. Docker registry may be
configured via Environment Variables (eg. DOCKER_AUTH_CONFIG) and the image repository can be configured in
execution_env.yml
file. Use --docker-push
flag to enable docker pushing during deployment.
DataHub synchronization
The deployment also sends metadata to DataHub
based on receipt created in datahub.yml
file. Use --datahub-ingest
flag to enable DataHub synchronization.
Packing and publishing
Sometimes there is a need to reuse data created in other projects and/or by a different team. The built project can be
converted to a dbt package by calling dp publish
. dp publish
parses manifest.json
and prepares a package from the presentation layer. It lists models created by transformations and they usually are a final product of a project. The models are prepared in form of dbt sources. Created metadata files are saved in the build/package
directory and sent to a git repository
configured in publish.yml
file.
Publication repo usually is private for a company and appropriate permissions are required. We recommend key-based
communication. You can use --key-path
as a parameter to point to the key file with push permissions.
Using published sources
Published packages can be used as standard dbt packages by adding them in packages.yml
in the following form:
packages:
- git: "https://{{env_var('DBT_GIT_USER_NAME', '')}}:{{env_var('DBT_GIT_SECRET_TOKEN', '')}}@gitlab.com/<path to you repository>"
subdirectory: "<upstream project name>"
Dependencies metadata
Created metadata files containing extra information about the project name (which can be also Airflow DAG name).
"source_meta": {
"dag": "<project name>"
}
This way explicit dependencies can be created in the execution environment. For more information see the documentation of dbt-airflow-factory <https://dbt-airflow-factory.readthedocs.io/en/latest/features.html#source-dependencies>
Clean project
If needed call dp clean
to remove compilation-related directories.
Load seed
One can use dp seed
to load seeds from the project. Use --env
to choose a different environment.
Serve documentation
dbt creates quite good documentation and sometimes it is useful to expose them to your coworkers on a custom port. To do that you can run
dbt docs --port <port>
command.
Project configuration
dp as a tool depends on a few files in your project directory. It must be able to find a config
directory with
a structure looking similar to this:
Whenever you call dp’s command with the --env <ENV>
flag, the tool will search for dbt.yml
and
<TARGET_TYPE>.yml
files in base
and <ENV>
directory and parse important info out of them, with <ENV>
settings taking precedence over those listed in base
. So, for example, for the following files:
# config/base/dbt.yml
target: env_execution
target_type: bigquery
# config/base/bigquery.yml
method: oauth
project: my-gcp-project
dataset: my-dataset
threads: 1
# cat config/dev/bigquery.yml
dataset: dev-dataset
dp test --env dev
will run dp test
command using values from those files, most notably with dataset: dev-dataset
overwriting
dataset: my-dataset
setting.
dp synthesizes dbt’s profiles.yml
out of those settings among other things. However, right now it only creates
local
or env_execution
profile, so if you want to use different settings amongst different environments, you
should rather use {{ env_var('VARIABLE') }}
as a value and provide those settings as environment variables. E.g., by
setting those in your config/<ENV>/k8s.yml
file, in envs
dictionary:
# config/base/bigquery.yml
method: oauth
dataset: "{{ env_var('GCP_DATASET') }}"
project: my-gcp-project
threads: 1
# config/base/execution_env.yml
# ... General config for execution env ...
# config/base/k8s.yml
# ... Kubernetes settings ...
# config/dev/k8s.yml
envs:
GCP_DATASET: dev-dataset
# config/prod/k8s.yml
envs:
GCP_DATASET: prod-dataset
dbt configuration
The main configuration is in config/<ENV>/dbt.yml
file. At the moment it allows setting two values:
* target
- should be set either to local
or env_execution
depending on where the tool is used. Local means
running locally while env_execution
means executing by the scheduler on the dev or prod environment.
* target_type
- defines which backend dbt will use and what file dp will search for additional configuration
(example: bigquery
or snowflake
).
Additionally, the backend configuration file should be provided with a name depending on the selected target_type
(<target_type>.yml). For example setting target_type
to bigquery
dp will look for bigquery.yml files.
This file should consist of all configurations that will be used to build profile.yml. Example files for the production
environment:
method: service-account
keyfile: "{{ env_var('GCP_KEY_PATH') }}"
project: gid-dataops-labs
dataset: presentation
threads: 1
timeout_seconds: 300
priority: interactive
location: europe-central2
retries: 1
Variables
You can put a dictionary of variables to be passed to dbt
in your config/<ENV>/dbt.yml
file, following the convention
presented in the guide at the dbt site.
E.g., if one of the fields of config/<SNOWFLAKE_ENV>/snowflake.yml
looks like this:
schema: "{{ var('snowflake_schema') }}"
you should put the following in your config/<SNOWFLAKE_ENV>/dbt.yml
file:
vars:
snowflake_schema: EXAMPLE_SCHEMA
and then run your dp run --env <SNOWFLAKE_ENV>
(or any similar command).
You can also add “global” variables to your dp config file $HOME/.dp.yml
. Be aware, however, that those variables
get erased on every dp init
call. It is a great idea to put commonly used variables in your organization’s
dp.yml.jinja
template and make copier ask for those when initializing dp. By doing so, each member of your
organization will end up with a list of user-specific variables reusable across different projects on its machine.
Just remember, global-scoped variables take precedence over project-scoped ones.
Airflow configuration
Airflow-related configuration is stored in config/<ENV>/airflow.yml
file and is strongly connected to the Airflow plugin: dbt-airflow-factory
More information about this configuration can be found here
One important config from dp tool in this file is dags_path
. It sets the URL to blob storage that is responsible for
storing projects DAGs with other artifacts.
Execution environment configuration
All configuration about how dbt is executed on the Airflow side is kept in execution_env.yml and <env type>.yml. More information about these settings can be found here
Publication configuration
config/<ENV>/publish.yml
file contains configuration about creating dbt packages for downstream projects and
publishing it to a git repository as a package registry.
Parameter |
Data type |
Description |
---|---|---|
repository |
string |
HTTPS link to repo that works as packages repository. |
branch |
string |
Branch of the selected repository where packages are published. |
username |
string |
User name that will be presented as package publisher in GIT. |
string |
Email of the package publisher. |
Data governance configuration
dp can sends dbt metadata to DataHub. All related configuration is stored in config/<ENV>/datahub.yml
file.
More information about it can be found here and here.
Business Intelligence configuration
BI configuration is divided into two levels:
General:
config/<ENV>/bi.yml
fileBI tool related: e.g.
config/<ENV>/looker.yml
config/<ENV>/bi.yml
contains basic configuration about BI integration:
Parameter |
Data type |
Description |
---|---|---|
is_bi_enabled |
bool |
Flag for enable/disable BI option in dp. |
bi_target |
string |
BI tool you want to working with (currently only Looker is supported). |
is_bi_compile |
bool |
Whether generate BI code in compile phase? |
is_bi_deploy |
bool |
Whether deploy and push BI codes? |
config/<ENV>/looker.yml
contains more detailed configuration related to BI tool:
Parameter |
Data type |
Description |
---|---|---|
looker_repository |
string |
Git repository used by Looker project you want to integrate. |
looker_repository_username |
string |
Git config - username for operating with repository |
looker_repository_email |
string |
Git config - user email for operating with repository |
looker_project_id |
string |
Looker’s project ID |
looker_webhook_secret |
string |
Looker’s project webhook secret for deployment |
looker_repository_branch |
string |
Looker’s repository branch for deploy new codes |
looker_instance_url |
string |
URL for you Looker instance |
Integration with environment
Data Pipelines CLI provides some sort of abstraction over multiple other components that take part in Data Pipeline processes. The following picture presents the whole environment which is handled by our tool.

dbt
dbt is currently the main tool that DP integrates with. The purpose of the DP tool is to cover dbt technicalities including configuration and generates it on the fly whenever needed. At the same time, it gives more control over dbt process management by chaining commands, interpolating configuration, and providing easy environments portability.
Copier
DP is heavily using Copier as templating tool. It gives a possibility to easily create new projects that are configured automatically after a series of questions. It is also used to configure the working environment with required environment variables.
Docker
One of the artifacts during building and publishing Data Pipelines are Docker’s images. Each created image contains dbt with its transformation and scripts to run. Created images are environment agnostic and can be deployed in any external configuration. Images are pushed to the selected Container Registry which configuration should be taken from the environment (there should be a docker client configured).
Git
The Data Pipelines CLI can also publish created dbt packages for downstream usage into configured GIT repository. It uses key-based authentication where the key is provided as parameter –key-path
Airflow
DP doesn’t communicate directly with Airflow, it rather sends artifacts to Object storage managed by Airflow and dbt-airflow-factory library handles the rest. Created projects keep DAG and configuration required to execute on the Airflow side.
Object storage
Configuration, Airflow DAG, and dbt manifest.json file are stored in Object storage for Airflow to be picked up and executed. the DP uses fsspec which gives a good abstraction over different object storage providers. Currently, the tools were tested with GCS and S3.
DataHub
The Data Pipelines CLI is able to send data to DataHub based on a recipe in configuration. The tool uses DataHub CLI under the hoot.
Visual Studio Code
VS Code is one of the recommended by us tools to work with dbt. DP tool simplify integration of the created project with the VS Code plugin for dbt management.
Airbyte
Under development
Looker
dp can generate lookML codes for your models and views, publish and deploy your Looker project
CLI Commands Reference
If you are looking for extensive information on a specific CLI command, this part of the documentation is for you.
dp
dp [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
clean
Delete local working directories
dp clean [OPTIONS]
compile
Create local working directories and build artifacts
dp compile [OPTIONS]
Options
- --env <env>
Required Name of the environment
- Default
local
- --docker-build
Whether to build a Docker image
- --docker-tag <docker_tag>
Image tag of a Docker image to create
- --docker-args <docker_args>
Args required to build project in json format
create
Create a new project using a template
dp create [OPTIONS] PROJECT_PATH [TEMPLATE_PATH]...
Options
- --vcs-ref <vcs_ref>
Git reference to checkout
Arguments
- PROJECT_PATH
Required argument
- TEMPLATE_PATH
Optional argument(s)
deploy
Push and deploy the project to the remote machine
dp deploy [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
base
- --dags-path <dags_path>
Remote storage URI
- --blob-args <blob_args>
Path to JSON or YAML file with arguments that should be passed to your Bucket/blob provider
- --docker-push
Whether to push image to the Docker repository
- --datahub-ingest
Whether to ingest DataHub metadata
- --bi-git-key-path <bi_git_key_path>
Path to the key with write access to repo
- --auth-token <auth_token>
Authorization OIDC ID token for a service account to communication with cloud services
docs-serve
Generate and serve dbt documentation.
dp docs-serve [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
local
- --port <port>
Port to be used by the ‘dbt docs serve’ command
- Default
9328
generate
Generate additional dbt files
dp generate [OPTIONS] COMMAND [ARGS]...
model-yaml
Generate schema YAML using codegen or dbt-profiler
dp generate model-yaml [OPTIONS] [MODEL_PATH]...
Options
- --env <env>
Name of the environment
- Default
local
- --with-meta
Whether to generate dbt-profiler metadata
- --overwrite
Whether to overwrite existing YAML files
Arguments
- MODEL_PATH
Optional argument(s)
source-sql
Generate SQLs that represents tables in given dataset
dp generate source-sql [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
local
- --source-yaml-path <source_yaml_path>
Required Path to the ‘source.yml’ schema file
- Default
/home/docs/checkouts/readthedocs.org/user_builds/data-pipelines-cli/checkouts/0.24.1/docs/models/source/source.yml
- --staging-path <staging_path>
Required Path to the ‘staging’ directory
- Default
/home/docs/checkouts/readthedocs.org/user_builds/data-pipelines-cli/checkouts/0.24.1/docs/models/staging
- --overwrite
Whether to overwrite existing SQL files
source-yaml
Generate source YAML using codegen
dp generate source-yaml [OPTIONS] [SCHEMA_NAME]...
Options
- --env <env>
Name of the environment
- Default
local
- --source-path <source_path>
Required Path to the ‘source’ directory
- Default
/home/docs/checkouts/readthedocs.org/user_builds/data-pipelines-cli/checkouts/0.24.1/docs/models/source
- --overwrite
Whether to overwrite an existing YAML file
Arguments
- SCHEMA_NAME
Optional argument(s)
init
Configure the tool for the first time
dp init [OPTIONS] [CONFIG_PATH]...
Arguments
- CONFIG_PATH
Optional argument(s)
prepare-env
Prepare local environment for apps interfacing with dbt
dp prepare-env [OPTIONS]
Options
- --env <env>
Name of the environment
publish
Create a dbt package out of the project
dp publish [OPTIONS]
Options
- --key-path <key_path>
Required Path to the key with write access to repo with published packages
- --env <env>
Required Name of the environment
- Default
base
run
Run the project on the local machine
dp run [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
local
seed
Run ‘dbt seed’
dp seed [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
local
template-list
Print a list of all templates saved in the config file
dp template-list [OPTIONS]
test
Run tests of the project on the local machine
dp test [OPTIONS]
Options
- --env <env>
Name of the environment
- Default
local
update
Update project from its template
dp update [OPTIONS] [PROJECT_PATH]...
Options
- --vcs-ref <vcs_ref>
Git reference to checkout
Arguments
- PROJECT_PATH
Optional argument(s)
API Reference
If you are looking for information on a specific function, class, or method, this part of the documentation is for you.
data_pipelines_cli package
data-pipelines-cli (dp) is a CLI tool designed for data platform.
dp helps data analysts to create, maintain and make full use of their data pipelines.
Subpackages
data_pipelines_cli.cli_commands package
- generate_models_or_sources_from_single_table(env: str, macro_name: str, macro_args: Dict[str, Any], profiles_path: pathlib.Path) Dict[str, Any] [source]
- compile_project(env: str, docker_tag: Optional[str] = None, docker_build: bool = False, docker_build_args: Optional[Dict[str, str]] = None) None [source]
Create local working directories and build artifacts.
- Parameters
env (str) – Name of the environment
docker_tag (Optional[str]) – Image tag of a Docker image to create
docker_build (bool) – Whether to build a Docker image
bi_build – Whether to generate a BI codes
- Raises
- create(project_path: str, template_path: Optional[str], vcs_ref: str) None [source]
Create a new project using a template.
- Parameters
project_path (str) – Path to a directory to create
template_path (Optional[str]) – Path or URI to the repository of the project template
- Raises
DataPipelinesError – no template found in .dp.yml config file
- class DeployCommand(env: str, docker_push: bool, dags_path: Optional[str], provider_kwargs_dict: Optional[Dict[str, Any]], datahub_ingest: bool, bi_git_key_path: str, auth_token: Optional[str])[source]
Bases:
object
A class used to push and deploy the project to the remote machine.
- auth_token: Optional[str]
Authorization OIDC ID token for a service account to communication with Airbyte instance
- bi_git_key_path: str
- blob_address_path: str
URI of the cloud storage to send build artifacts to
- datahub_ingest: bool
Whether to ingest DataHub metadata
- deploy() None [source]
Push and deploy the project to the remote machine.
- Raises
DependencyNotInstalledError – DataHub or Docker not installed
DataPipelinesError – Error while pushing Docker image
- docker_args: Optional[data_pipelines_cli.data_structures.DockerArgs]
Arguments required by the Docker to make a push to the repository. If set to None,
deploy()
will not make a push
- env: str
- provider_kwargs_dict: Dict[str, Any]
Dictionary of arguments required by a specific cloud storage provider, e.g. path to a token, username, password, etc.
- init(config_path: Optional[str]) None [source]
Configure the tool for the first time.
- Parameters
config_path (Optional[str]) – URI of the repository with a template of the config file
- Raises
DataPipelinesError – user do not want to overwrite existing config file
- prepare_env(env: str) None [source]
Prepare local environment for use with dbt-related applications.
Prepare local environment for use with applications expecting a “traditional” dbt structure, such as plugins to VS Code. If in doubt, use
dp run
anddp test
instead.- Parameters
env (str) – Name of the environment
- create_package() pathlib.Path [source]
Create a dbt package out of the built project.
- Raises
DataPipelinesError – There is no model in ‘manifest.json’ file.
Submodules
data_pipelines_cli.airbyte_utils module
- class AirbyteFactory(airbyte_config_path: pathlib.Path, auth_token: Optional[str])[source]
Bases:
object
A class used to create and update Airbyte connections defined in config yaml file
- airbyte_config_path: pathlib.Path
Path to config yaml file containing connections definitions
- auth_token: Optional[str]
Authorization OIDC ID token for a service account to communication with Airbyte instance
data_pipelines_cli.bi_utils module
- bi(env: str, bi_action: data_pipelines_cli.bi_utils.BiAction, key_path: Optional[str] = None) None [source]
Generate and deploy BI codes using dbt compiled data.
- Parameters
env (str) – Name of the environment
bi_action – Action to be run [COMPILE, DEPLOY]
key_path – Path to the key with write access to git repository
- Raises
NotSuppertedBIError – Not supported bi in bi.yml configuration
data_pipelines_cli.cli module
data_pipelines_cli.cli_configs module
data_pipelines_cli.cli_constants module
- DEFAULT_GLOBAL_CONFIG: data_pipelines_cli.data_structures.DataPipelinesConfig = {'templates': {}, 'vars': {}}
Content of the config file created by dp init command if no template path is provided
- IMAGE_TAG_TO_REPLACE: str = '<IMAGE_TAG>'
- PROFILE_NAME_ENV_EXECUTION = 'env_execution'
Name of the dbt target to use for a remote machine
- PROFILE_NAME_LOCAL_ENVIRONMENT = 'local'
Name of the environment and dbt target to use for a local machine
data_pipelines_cli.cli_utils module
- echo_error(text: str, **kwargs: Any) None [source]
Print an error message to stderr using click-specific print function.
- Parameters
text (str) – Message to print
kwargs –
- echo_info(text: str, **kwargs: Any) None [source]
Print a message to stdout using click-specific print function.
- Parameters
text (str) – Message to print
kwargs –
- echo_suberror(text: str, **kwargs: Any) None [source]
Print a suberror message to stderr using click-specific print function.
- Parameters
text (str) – Message to print
kwargs –
- echo_subinfo(text: str, **kwargs: Any) None [source]
Print a subinfo message to stdout using click-specific print function.
- Parameters
text (str) – Message to print
kwargs –
- echo_warning(text: str, **kwargs: Any) None [source]
Print a warning message to stderr using click-specific print function.
- Parameters
text (str) – Message to print
kwargs –
- get_argument_or_environment_variable(argument: Optional[str], argument_name: str, environment_variable_name: str) str [source]
Given argument is not
None
, return its value. Otherwise, search for environment_variable_name amongst environment variables and return it. If such a variable is not set, raiseDataPipelinesError
.- Parameters
argument (Optional[str]) – Optional value passed to the CLI as the argument_name
argument_name (str) – Name of the CLI’s argument
environment_variable_name (str) – Name of the environment variable to search for
- Returns
Value of the argument or specified environment variable
- Raises
DataPipelinesError – argument is
None
and environment_variable_name is not set
- subprocess_run(args: List[str], capture_output: bool = False) subprocess.CompletedProcess[bytes] [source]
Run subprocess and return its state if completed with a success. If not, raise
SubprocessNonZeroExitError
.- Parameters
args (List[str]) – List of strings representing subprocess and its arguments
capture_output (bool) – Whether to capture output of subprocess.
- Returns
State of the completed process
- Return type
subprocess.CompletedProcess[bytes]
- Raises
SubprocessNonZeroExitError – subprocess exited with non-zero exit code
data_pipelines_cli.config_generation module
- class DbtProfile(**kwargs)[source]
Bases:
dict
POD representing dbt’s profiles.yml file.
- outputs: Dict[str, Dict[str, Any]]
Dictionary of a warehouse data and credentials, referenced by target name
- target: str
Name of the target for dbt to run
- copy_config_dir_to_build_dir() None [source]
Recursively copy config directory to build/dag/config working directory.
- copy_dag_dir_to_build_dir() None [source]
Recursively copy dag directory to build/dag working directory.
- generate_profiles_dict(env: str, copy_config_dir: bool) Dict[str, data_pipelines_cli.config_generation.DbtProfile] [source]
Generate and save
profiles.yml
file atbuild/profiles/local
orbuild/profiles/env_execution
, depending on env argument.- Parameters
env (str) – Name of the environment
copy_config_dir (bool) – Whether to copy
config
directory tobuild
working directory
- Returns
Dictionary representing data to be saved in
profiles.yml
- Return type
Dict[str, DbtProfile]
- generate_profiles_yml(env: str, copy_config_dir: bool = True) pathlib.Path [source]
Generate and save
profiles.yml
file atbuild/profiles/local
orbuild/profiles/env_execution
, depending on env argument.- Parameters
env (str) – Name of the environment
copy_config_dir (bool) – Whether to copy
config
directory tobuild
working directory
- Returns
Path to
build/profiles/{env}
- Return type
pathlib.Path
- get_profiles_dir_build_path(env: str) pathlib.Path [source]
Returns path to
build/profiles/<profile_name>/
, depending on env argument.- Parameters
env (str) – Name of the environment
- Returns
- Return type
pathlib.Path
- read_dictionary_from_config_directory(config_path: Union[str, os.PathLike[str]], env: str, file_name: str) Dict[str, Any] [source]
Read dictionaries out of file_name in both base and env directories, and compile them into one. Values from env directory get precedence over base ones.
- Parameters
config_path (Union[str, os.PathLike[str]]) – Path to the config directory
env (str) – Name of the environment
file_name (str) – Name of the YAML file to parse dictionary from
- Returns
Compiled dictionary
- Return type
Dict[str, Any]
data_pipelines_cli.data_structures module
- class DataPipelinesConfig(**kwargs)[source]
Bases:
dict
POD representing .dp.yml config file.
- templates: Dict[str, data_pipelines_cli.data_structures.TemplateConfig]
Dictionary of saved templates to use in dp create command
- vars: Dict[str, str]
Variables to be passed to dbt as –vars argument
- class DbtModel(**kwargs)[source]
Bases:
dict
POD representing a single model from ‘schema.yml’ file.
- columns: List[data_pipelines_cli.data_structures.DbtTableColumn]
- description: str
- identifier: str
- meta: Dict[str, Any]
- name: str
- tags: List[str]
- tests: List[str]
- class DbtSource(**kwargs)[source]
Bases:
dict
POD representing a single source from ‘schema.yml’ file.
- database: str
- description: str
- meta: Dict[str, Any]
- name: str
- schema: str
- tables: List[data_pipelines_cli.data_structures.DbtModel]
- tags: List[str]
- class DbtTableColumn(**kwargs)[source]
Bases:
dict
POD representing a single column from ‘schema.yml’ file.
- description: str
- meta: Dict[str, Any]
- name: str
- quote: bool
- tags: List[str]
- tests: List[str]
- class DockerArgs(env: str, image_tag: Optional[str], build_args: Dict[str, str])[source]
Bases:
object
Arguments required by the Docker to make a push to the repository.
- Raises
DataPipelinesError – repository variable not set or git hash not found
- build_args: Dict[str, str]
- docker_build_tag() str [source]
Prepare a tag for Docker Python API build command.
- Returns
Tag for Docker Python API build command
- Return type
str
- image_tag: str
An image tag
- repository: str
URI of the Docker images repository
- class TemplateConfig(**kwargs)[source]
Bases:
dict
POD representing value referenced in the templates section of the .dp.yml config file.
- template_name: str
Name of the template
- template_path: str
Local path or Git URI to the template repository
- read_env_config() data_pipelines_cli.data_structures.DataPipelinesConfig [source]
Parse .dp.yml config file, if it exists. Otherwise, raises
NoConfigFileError
.- Returns
POD representing .dp.yml config file, if it exists
- Return type
- Raises
NoConfigFileError – .dp.yml file not found
data_pipelines_cli.dbt_utils module
- read_dbt_vars_from_configs(env: str) Dict[str, Any] [source]
Read vars field from dp configuration file (
$HOME/.dp.yml
), basedbt.yml
config (config/base/dbt.yml
) and environment-specific config (config/{env}/dbt.yml
) and compile into one dictionary.- Parameters
env (str) – Name of the environment
- Returns
Dictionary with vars and their keys
- Return type
Dict[str, Any]
- run_dbt_command(command: Tuple[str, ...], env: str, profiles_path: pathlib.Path, log_format_json: bool = False, capture_output: bool = False) subprocess.CompletedProcess[bytes] [source]
Run dbt subprocess in a context of specified env.
- Parameters
command (Tuple[str, ...]) – Tuple representing dbt command and its optional arguments
env (str) – Name of the environment
profiles_path (pathlib.Path) – Path to the directory containing profiles.yml file
log_format_json (bool) – Whether to run dbt command with –log-format=json flag
capture_output (bool) – Whether to capture stdout of subprocess.
- Returns
State of the completed process
- Return type
subprocess.CompletedProcess[bytes]
- Raises
SubprocessNotFound – dbt not installed
SubprocessNonZeroExitError – dbt exited with error
data_pipelines_cli.docker_response_reader module
- class DockerReadResponse(msg: str, is_error: bool)[source]
Bases:
object
POD representing Docker response processed by
DockerResponseReader
.- is_error: bool
Whether response is error or not
- msg: str
Read and processed message
- class DockerResponseReader(logs_generator: Iterable[Union[str, Dict[str, Union[str, Dict[str, str]]]]])[source]
Bases:
object
Read and process Docker response.
Docker response turns into processed strings instead of plain dictionaries.
- cached_read_response: Optional[List[data_pipelines_cli.docker_response_reader.DockerReadResponse]]
Internal cache of already processed response
- click_echo_ok_responses() None [source]
Read, process and print positive Docker updates.
- Raises
DockerErrorResponseError – Came across error update in Docker response.
- logs_generator: Iterable[Union[str, Dict[str, Union[str, Dict[str, str]]]]]
Iterable representing Docker response
- read_response() List[data_pipelines_cli.docker_response_reader.DockerReadResponse] [source]
Read and process Docker response.
- Returns
List of processed lines of response
- Return type
List[DockerReadResponse]
data_pipelines_cli.errors module
- exception AirflowDagsPathKeyError[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if there is no
dags_path
in airflow.yml file.- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception DataPipelinesError(message: str, submessage: Optional[str] = None)[source]
Bases:
Exception
Base class for all exceptions in data_pipelines_cli module
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception DependencyNotInstalledError(program_name: str)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if certain dependency is not installed
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception DockerErrorResponseError(error_msg: str)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if there is an error response from Docker client.
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception DockerNotInstalledError[source]
Bases:
data_pipelines_cli.errors.DependencyNotInstalledError
Exception raised if ‘docker’ is not installed
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception JinjaVarKeyError(key: str)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception NoConfigFileError[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if .dp.yml does not exist
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception NotAProjectDirectoryError(project_path: str)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if .copier-answers.yml file does not exist in given dir
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception NotSuppertedBIError[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if there is no
target_id
in bi.yml- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception SubprocessNonZeroExitError(subprocess_name: str, exit_code: int, subprocess_output: Optional[str] = None)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if subprocess exits with non-zero exit code
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
- exception SubprocessNotFound(subprocess_name: str)[source]
Bases:
data_pipelines_cli.errors.DataPipelinesError
Exception raised if subprocess cannot be found
- message: str
explanation of the error
- submessage: Optional[str]
additional informations for the error
data_pipelines_cli.filesystem_utils module
- class LocalRemoteSync(local_path: Union[str, os.PathLike[str]], remote_path: str, remote_kwargs: Dict[str, str])[source]
Bases:
object
Synchronizes local directory with a cloud storage’s one.
- local_fs: fsspec.spec.AbstractFileSystem
FS representing local directory
- local_path_str: str
Path to local directory
- remote_path_str: str
Path/URI of the cloud storage directory
data_pipelines_cli.io_utils module
- git_revision_hash() Optional[str] [source]
Get current Git revision hash, if Git is installed and any revision exists.
- Returns
Git revision hash, if possible.
- Return type
Optional[str]
- replace(filename: Union[str, os.PathLike[str]], pattern: str, replacement: str) None [source]
Perform the pure-Python equivalent of in-place sed substitution: e.g.,
sed -i -e 's/'${pattern}'/'${replacement}' "${filename}"
.Beware however, it uses Python regex dialect instead of sed’s one. It can introduce regex-related bugs.
data_pipelines_cli.jinja module
- replace_vars_with_values(templated_dictionary: Dict[str, Any], dbt_vars: Dict[str, Any]) Dict[str, Any] [source]
Replace variables in given dictionary using Jinja template in its values.
- Parameters
templated_dictionary (Dict[str, Any]) – Dictionary with Jinja-templated values
dbt_vars (Dict[str, Any]) – Variables to replace
- Returns
Dictionary with replaced variables
- Return type
Dict[str, Any]
- Raises
JinjaVarKeyError – Variable referenced in Jinja template does not exist
data_pipelines_cli.looker_utils module
data_pipelines_cli.vcs_utils module
Utilities related to VCS.
- add_suffix_to_git_template_path(template_path: str) str [source]
Add
.git
suffix to template_path, if necessary.Check if template_path starts with Git-specific prefix (e.g. git://), or http:// or https:// protocol. If so, then add
.git
suffix if not present. Does nothing otherwise (as template_path probably points to a local directory).- Parameters
template_path (str) – Path or URI to Git-based repository
- Returns
template_path with
.git
as suffix, if necessary- Return type
str
Changelog
Unreleased
0.24.1 - 2023-03-15
0.24.0 - 2022-12-16
Airbyte integration
dp deploy
is able to add / update connections on Airbyte instancedp deploy
is able to create DAG at the beggining of dbt builds that will execute ingestion tasksdp deploy
accept additional attributeauth-token
that can be used to authorize access to cloud servicesBump packages
0.23.0 - 2022-10-19
0.22.1 - 2022-10-11
Looker integration
dp compile
is able generate lookML project for Lookerdp deploy
is able to publish lookML codes in Looker’s repo and deploy project.
0.22.0 - 2022-08-22
dp compile
default environment hes been set tolocal
GitPython is not required anymore
Installation documentation upgrade
0.21.0 - 2022-07-19
Documentation improvements
0.20.1 - 2022-06-17
Fixed
dp seed
,dp run
anddp test
no longer fail when we are not using git repository.
0.20.0 - 2022-05-04
--docker-args
has been added todp compile
0.19.0 - 2022-04-25
Added
dp seed
command acting as a wrapper fordbt seed
.
0.18.0 - 2022-04-19
Added
dp docs-serve
command acting as a wrapper fordbt docs serve
.
0.17.0 - 2022-04-11
Added
pip install data-pipelines-cli[ADAPTER_PROVIDER]
installs adapter alongside dbt-core, e.g.pip install data-pipelines-cli[bigquery]
.
Changed
dp compile
accepts additional command line argument--docker-tag
, allowing for custom Docker tag instead of relying on Git commit SHA. Moreover, if--docker-tag
is not provided, dp searches for tag inbuild/dag/config/<ENV>/execution_env.yml
. If it is present instead of<IMAGE_TAG>
to be replaced, dp chooses it over Git commit SHA.
0.16.0 - 2022-03-24
Added
dp generate source-yaml
anddp generate model-yaml
commands that automatically generate YAML schema files for project’s sources or models, respectively (using dbt-codegen or dbt-profiler under the hood).dp generate source-sql
command that generates SQL representing sources listed insource.yml
(or a similar file) (again, with the help of dbt-codegen).
0.15.2 - 2022-02-28
Changed
Bumped
dbt
to 1.0.3.
0.15.1 - 2022-02-28
Fixed
Pinned
MarkupSafe==2.0.1
to ensure that Jinja works.
0.15.0 - 2022-02-11
Migration to dbt 1.0.1
0.14.0 - 2022-02-02
0.13.0 - 2022-02-01
0.12.0 - 2022-01-31
dp publish
will push generated sources to external git repo
0.11.0 - 2022-01-18
Added
dp update
commanddp publish
command for creation of dbt package out of the project.
Changed
Docker response in
deploy
andcompile
gets printed as processed strings instead of plain dictionaries.dp compile
parses content ofdatahub.yml
and replaces Jinja variables in the form ofvar
orenv_var
.dags_path
is read from an envedairflow.yml
file.
0.10.0 - 2022-01-12
Changed
Run
dbt deps
at the end ofdp prepare-env
.
Fixed
dp run
anddp test
are no longer pointing toprofiles.yml
instead of the directory containing it.
0.9.0 - 2022-01-03
Added
--env
flag todp deploy
.
Changed
Docker repository URI gets read out of
build/config/{env}/k8s.yml
.
Removed
--docker-repository-uri
and--datahub-gms-uri
fromdp compile
anddp deploy
commands.dp compile
no longer replaces<INGEST_ENDPOINT>
indatahub.yml
, or<DOCKER_REPOSITORY_URL>
ink8s.yml
0.8.0 - 2021-12-31
Changed
dp init
anddp create
automatically adds.git
suffix to given template paths, if necessary.When reading dbt variables, global-scoped variables take precedence over project-scoped ones (it was another way around before).
Address argument for
dp deploy
is no longer mandatory. It should be either placed inairflow.yml
file as value ofdags_path
key, or provided with--dags-path
flag.
0.7.0 - 2021-12-29
Added
Add documentation in the style of Read the Docs.
Exception classes in
errors.py
, deriving fromDataPipelinesError
base exception class.Unit tests to massively improve code coverage.
--version
flag to dp command.Add
dp prepare-env
command that prepares local environment for standalone dbt (right now, it only generates and savesprofiles.yml
in$HOME/.dbt
).
Changed
dp compile
:--env
option has a default value:base
,--datahub
is changed to--datahub-gms-uri
,--repository
is changed to--docker-repository-uri
.
dp deploy
’s--docker-push
is not a flag anymore and requires a Docker repository URI parameter;--repository
got removed then.dp run
anddp test
rundp compile
before actual dbt command.Functions raise exceptions instead of exiting using
sys.exit(1)
;cli.cli()
entrypoint is expecting exception and exits only there.dp deploy
raises an exception if there is no Docker image to push orbuild/config/dag
directory does not exist.Rename
gcp
togcs
in requirements (now one should runpip install data-pipelines-cli[gcs]
).
0.6.0 - 2021-12-16
Modified
dp saves generated
profiles.yml
in eitherbuild/local
orbuild/env_execution
directories. dbt gets executed withenv_execution
as the target.
0.5.1 - 2021-12-14
Fixed
_dbt_compile
is no longer removing replaced<IMAGE_TAG>
.
0.5.0 - 2021-12-14
Added
echo_warning
function prints warning messages in yellow/orange color.
Modified
Docker image gets built at the end of
compile
command.dbt-related commands do not fail if no
$HOME/.dp.yml
exists (e.g.,dp run
).
Removed
Dropped
dbt-airflow-manifest-parser
dependency.
0.4.0 - 2021-12-13
Added
dp run
anddp test
commands.dp clean
command for removingbuild
andtarget
directories.File synchronization tests for Google Cloud Storage using
gcp-storage-emulator
.Read vars from config files (
$HOME/.dp.yml
,config/$ENV/dbt.yml
) and pass todbt
.
Modified
profiles.yml
gets generated and saved inbuild
directory indp compile
, instead of relying on a local one in the main project directory.dp dbt <command>
generatesprofiles.yml
inbuild
directory by default.dp init
is expectingconfig_path
argument to download config template with the help of thecopier
and save it in$HOME/.dp.yml
.dp template list
is renamed asdp template-list
.dp create
allows for providing extra argument calledtemplate-path
, being either name of one of templates defined in.dp.yml
config file or direct link to Git repository.
Removed
Support for manually created
profiles.yml
in main project directory.dp template new
command.username
field from$HOME/.dp.yml
file.
0.3.0 - 2021-12-06
Run
dbt deps
alongside rest ofdbt
commands indp compile
0.2.0 - 2021-12-03
Add support for GCP and S3 syncing in
dp deploy
0.1.2 - 2021-12-02
Fix: do not use styled
click.secho
for Docker push response, as it may not be astr
0.1.1 - 2021-12-01
Fix Docker SDK for Python’s bug related to tagging, which prevented Docker from pushing images.
0.1.0 - 2021-12-01
Added
Draft of
dp init
,dp create
,dp template new
,dp template list
anddp dbt
Draft of
dp compile
anddp deploy