Quickstart
You can go through the written quickstart here or watch the video on YouTube:
Before you start, make sure that you have the following:
AWS CLI installed
AWS SageMaker domain
SageMaker Execution role ARN (in a form arn:aws:iam::<ID>:role/service-role/AmazonSageMaker-ExecutionRole-<NUMBERS>). If you don’t have one, follow the [official AWS docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role).
S3 bucket that the above role has R/W access
Docker installed
Amazon Elastic Container Registry (Amazon ECR) repository created that the above role has read access and you have write access
Prepare new virtual environment with Python >=3.8. Install the packages
pip install "kedro>=0.18.3,<0.19" "kedro-sagemaker"
Create new project (e.g. from starter). !!! Make sure you don’t name it
kedro-sagemaker
because you will overwrite Python module name.
kedro new --starter=spaceflights
Project Name
============
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.
[Spaceflights]: kedro_sagemaker_demo
The project name 'kedro_sagemaker_demo' has been applied to:
- The project title in /Users/marcin/Dev/tmp/kedro-sagemaker-demo/README.md
- The folder created for your project in /Users/marcin/Dev/tmp/kedro-sagemaker-demo
- The project's python package in /Users/marcin/Dev/tmp/kedro-sagemaker-demo/src/kedro_sagemaker_demo
Go to the project’s directory:
cd kedro-sagemaker-demo
Add
kedro-sagemaker
tosrc/requirements.txt
(optional) Remove
kedro-telemetry
fromsrc/requirements.txt
or set appropriate settings (https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry).Install the requirements
pip install -r src/requirements.txt
Initialize Kedro SageMaker plugin. Provide name of the S3 bucket and full ARN of the SageMaker Execution role (which should also have access to the S3 bucket). For
DOCKER_IMAGE
- use full name of the ECR repository that you want to push your docker image.
#Usage: kedro sagemaker init [OPTIONS] BUCKET EXECUTION_ROLE DOCKER_IMAGE
kedro sagemaker init <bucket-name> <role-arn> <ecr-image-uri>
The init
command automatically will create:
conf/base/sagemaker.yml
configuration file, which controls this plugin’s behaviourDockerfile
and.dockerignore
files pre-configured to work with Amazon SageMaker
Adjust the Data Catalog - the default one stores all data locally, whereas the plugin will automatically use S3. Only input data is required to be read locally. Final
conf/base/catalog.yml
should look like this:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
layer: raw
reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
layer: raw
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
layer: raw
(optional) Login to ECR, if you have not logged in before. You can run the following snippet in the terminal (adjust the region to match your configuration).
REGION=eu-central-1; aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin "<AWS project ID>.dkr.ecr.$(echo $REGION).amazonaws.com"
Run your Kedro project on AWS SageMaker pipelines with a single command:
kedro sagemaker run --auto-build -y
This command will first build the docker image with your project, push it to the configured ECR and then it will run the pipeline in AWS SageMaker pipelines service.
Finally, you will see similar logs in your terminal:
Pipeline ARN: arn:aws:sagemaker:eu-central-1:781336771001:pipeline/kedro-sagemaker-pipeline
Pipeline started successfully
Additionally, if you have (kedro-mlflow) plugin installed, an additional node called start-mlflow-run will appear on execution graph. It’s job is to log the SageMaker’s Pipeline Execution ARN (so you can link runs with mlflow with runs in SageMaker) and make sure that all nodes use common Mlflow run.