TLDR
- Add submodule using:
git submodule add git@gitlab.company.com:path/repository.git desired_name
- Adjust
.gitmodules
in the root of the directory so that theurl
uses a relative path to the submodule's repository, e.g.:
[submodule "dags/dag_name/repository"]path = dags/dag_name/repositoryurl = ../repository.git
- In the CI/CD pipeline set a variable so that Airflow actually downloads the submodule:
variables:GIT_SUBMODULE_STRATEGY: normal
Story
In Kiwi.com, I have been taking care of a couple of feed generators. Very simply put, these are repositories with a Python script that downloads data from BigQuery and runs an optimization procedure in order to create a feed (a file with e.g. a content that is the most likely to be sold). At first, the script was run once upon a time, however soon we started to need it to run every day. To do this, the easiest way is to run it via Airflow which has been already adopted company-wide. The most common way how to run a script daily via PythonOperator is to define the function simply in the DAG. Another way is to define it as a plugin. However, the repository with the script is actually quite big and complex and there was an additional requirement that it should still keep its functionality to be able to be run locally. Ideally, it should still stay in a separate repository and don't be merged with the Airflow repository. Thus, I figured that we could import it as a submodule. That way, it will stay in a separate repository and within the DAG we will simply the import the main function which would be triggered daily via PythonOperator.
The steps to do this are actually quite simple. At first, simply add the submodule in the DAG folder using:
git submodule add git@gitlab.company.com:path/repository.git desired_name
This will add the submodule to the DAG folder, plus, it will add the module to .git/modules/
and also modify .gitmodules
. The .gitmodules
file will look like this:
[submodule "dags/dag_name/repository"]path = dags/dag_name/repositoryurl = git@gitlab.company.com:path/repository.git
The second step is to modify this .gitmodules
file.
If we kept it like this, the Airflow would actually not be able to download the submodule.
Imagine that like this it would access it like outside of the company.
Instead, if the url
is defined as a relative path like this:
[submodule "dags/dag_name/repository"]path = dags/dag_name/repositoryurl = ../repository.git
This way, it will stay in the repository and will be able to download the submodule.
At last, we need to modify the CI/CD pipeline because by default it does not download submodules. We need to set a variable to do it.
variables:GIT_SUBMODULE_STRATEGY: normal
When the repository imported as a submodule is imported next to the dag file, I use:
import sysadditional_path = Path(__file__).parents[0] / "repository"if additional_path not in sys.path:sys.path.append(str(additional_path.absolute()))print(sys.path)
in order to change the Python path, so that I am able to simply import the generate_feed
function e.g. like this:
from generate_feed import generate_feed