TLDR

  • Add submodule using:
git submodule add git@gitlab.company.com:path/repository.git desired_name
  • Adjust .gitmodules in the root of the directory so that the url uses a relative path to the submodule's repository, e.g.:
[submodule "dags/dag_name/repository"]
path = dags/dag_name/repository
url = ../repository.git
  • In the CI/CD pipeline set a variable so that Airflow actually downloads the submodule:
variables:
GIT_SUBMODULE_STRATEGY: normal

Story

In Kiwi.com, I have been taking care of a couple of feed generators. Very simply put, these are repositories with a Python script that downloads data from BigQuery and runs an optimization procedure in order to create a feed (a file with e.g. a content that is the most likely to be sold). At first, the script was run once upon a time, however soon we started to need it to run every day. To do this, the easiest way is to run it via Airflow which has been already adopted company-wide. The most common way how to run a script daily via PythonOperator is to define the function simply in the DAG. Another way is to define it as a plugin. However, the repository with the script is actually quite big and complex and there was an additional requirement that it should still keep its functionality to be able to be run locally. Ideally, it should still stay in a separate repository and don't be merged with the Airflow repository. Thus, I figured that we could import it as a submodule. That way, it will stay in a separate repository and within the DAG we will simply the import the main function which would be triggered daily via PythonOperator.

The steps to do this are actually quite simple. At first, simply add the submodule in the DAG folder using:

git submodule add git@gitlab.company.com:path/repository.git desired_name

This will add the submodule to the DAG folder, plus, it will add the module to .git/modules/ and also modify .gitmodules. The .gitmodules file will look like this:

[submodule "dags/dag_name/repository"]
path = dags/dag_name/repository
url = git@gitlab.company.com:path/repository.git

The second step is to modify this .gitmodules file. If we kept it like this, the Airflow would actually not be able to download the submodule. Imagine that like this it would access it like outside of the company. Instead, if the url is defined as a relative path like this:

[submodule "dags/dag_name/repository"]
path = dags/dag_name/repository
url = ../repository.git

This way, it will stay in the repository and will be able to download the submodule.

At last, we need to modify the CI/CD pipeline because by default it does not download submodules. We need to set a variable to do it.

variables:
GIT_SUBMODULE_STRATEGY: normal

When the repository imported as a submodule is imported next to the dag file, I use:

import sys
additional_path = Path(__file__).parents[0] / "repository"
if additional_path not in sys.path:
sys.path.append(str(additional_path.absolute()))
print(sys.path)

in order to change the Python path, so that I am able to simply import the generate_feed function e.g. like this:

from generate_feed import generate_feed