Amazon SageMaker & AWS LakeFormation Integration - Vend Temporary Credentials to read data from S3 in a Pandas and Spark dataframe
This example demonstrates how to read data from Amazon S3 using temporary credentials vended by LakeFormation without the execution role being used having any direct access to S3. The vended credentials are used to read data from S3 into a Pandas dataframe and a Spark dataframe.
This solution contains a utility function that invokes LakeFormation APIs to grant permissions and vend temporary credentials for a table registered with the Glue Data Catalog. This utility function is invoked from a SageMaker Studio Notebook to get the temporary credentials and then use these credentials to read data from S3. Note that the temporary credentials are retrieved from a different application specific role that represents the application that wants to read the data. The application in this example is the SageMaker Studio Notebook and it is asking LakeFormation (through the credential vending utility function) to provide it the credentials so that it can read the data. It is important to note that the application itself has NO access to read the data from S3 and depends upon the temporary vended credentials to read the data; thus enforcing the coarse grained access conrol through LakeFormation.
This solution requires that you have SageMaker Studio setup in your account and have the necessary IAM permissions to setup policies for LakeFormation and S3 access as described in the next section.
This solution requires LakeFormation, Glue and IAM role setup. Each of these is described in the sections below.
There are two IAM roles involved in this solution, a SageMaker Execution Role and an Application Role. The SageMaker Execution Role is used to run the notebook and it is also the one that has the required access to vend temporary credentials to the Application Role. The Application Role as the name suggests is a role tied to the application (in this case the SageMaker Notebook) which in itself has no access to read data from S3 but can be assigned temporary credentials by the SageMaker Execution Role to enable it to temporarily read data from S3.
-
Create an inline policy with LakeFormation Permissions as shown below and assign it to the
SageMaker Execution Role.{ "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "lakeformation:GetDataLakeSettings", "lakeformation:GrantPermissions", "glue:GetTable", "glue:CreateTable", "glue:GetTables", "glue:GetDatabase", "glue:GetDatabases", "glue:CreateDatabase", "glue:GetUserDefinedFunction", "glue:GetUserDefinedFunctions", "glue:GetPartition", "glue:GetPartitions" ], "Resource": "*" } } -
Add
AmazonSageMakerFullAccesspermission (this might already be there). -
Add an inline policy to allow
sts:AssumeRoleandsts:TagSessionon the roleApplication Rolethat gets fine-grained access control over the table in the Glue data catalog.{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Action": [ "sts:AssumeRole", "sts:TagSession" ], "Resource": [ "arn:aws:iam::<your-account-id>:role/<application-role>" ] } ] }
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowAssumeRoleAndPassSessionTag",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-id>:root"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
],
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller":"<your-session-value-tag>"
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowPassSessionTags",
"Effect": "Allow",
"Principal": {
"AWS": "<application-role>"
},
"Action": "sts:TagSession",
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller": "<your-session-value-tag>"
}
}
}
]
}
On the LakeFormation administration page make the SageMaker Execution Role a data lake admin as shown in the screenshot below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeRoleAndPassSessionTag",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-id>:root"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
},
{
"Effect": "Allow",
"Principal": {
"Service": "lakeformation.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Sid": "AllowPassSessionTags",
"Effect": "Allow",
"Principal": {
"AWS": "<lf-role-to-be-assumed>"
},
"Action": "sts:TagSession",
"Condition": {
"StringLike": {
"aws:RequestTag/LakeFormationAuthorizedCaller":"<your-session-value-tag>"
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:GetPartition",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
The sagemaker-lf-credential-vending.ipynb notebook creates a Glue data catalog table and also places a file in S3 to hydrate the table. If you already have data in S3 that you would like read as a Glue data catalog table then you can do that by updating the DATABASE_NAME, DATABASE_S3_LOCATION and TABLE_NAME variables in the sagemaker-lf-credential-vending.ipynb notebook.
Once you have executed all the steps in the Setup section above you are now ready to run the code.
-
Open the
sagemaker-lf-credential-vending.ipynbnotebook and run all cells.- The SageMaker notebook
sagemaker-lf-credential-vending.ipynbruns on aSageMaker Execution Role. SageMaker Execution Roleis used to create a Database and a Glue data table using thecreate_tableandcreate_databaseAPIs.- For the purpose of this sample, we insert
dummy_dataas acsvfile within the S3 bucket of the corresponding to our table in theGlue data catalog. - More information is given in the Notebook as comments.
- The SageMaker notebook
-
The notebook sets the
AllowFullTableExternalDataAccesstoTruein thesettings['DataLakeSettings']to vend temporary credentials for the Glue table. -
The notebook uses the
get_lf_temp_credentialsprovided by thelf_vend_credentials.pymodule to get the temporary credentials to read data from S3 and then theread_spark_lf_dataandread_pandas_lf_datafrom theread_data.pyto read data using these credentials into a Spark and Pandas DataFrame respectively. Note that the code in this notebook refers to the data it needs to read via the Glue table name rather than by its path in S3 because the LakeFormation and Glue hide those details from this application.
- Vending Temporary Credentials: This file uses
SageMaker Execution Rolethat grants fine-grained access control toApplication Roleon the list of specific columns. After granting the role fine-grained access control on the requested columns theSageMaker Execution Roleperforms anassume roleonApplication Role, and gets temporary glue table credentials which contains theAccessKeyId,SecretAccessKey, andSessionToken. Seeget_lf_temp_credentialsfor more details.
- Reading data from S3 in
PandasandSpark: This file uses the temporary credentials, S3 path, file type and the list of columns that theApplication Rolehas fine-grained access to, and reads the data in aPandasandSparkDataFrame. Thepandas_read_lf_datafunction subsets and returns the metadata in aPandasDataFrame, and thespark_read_lf_datafunction returns the metadata in aSparkDataFrame.