- Published on
Setup Zero-ETL Integration with AWS RDS Aurora and Redshift
- Authors
- Name
- Gbenga Oni
- @gbxnga
Learn Hand-On AWS Cloud EngineeringHERE
Introduction
AWS has introduced the new Zero-ETL integration feature that facilitates the automatic replication of data from Amazon Aurora in real-time to AWS's datawarehouse service -- Redshift. This service is managed enitrely by AWS. This reduces the need to setup, manage and administer complex data pipelines. Petabytes of transaction data from Amazon Aurora can be made avaliable for real-time analytics.
In this article we will create an AWS Aurora database as source and an AWS Redshift Serverless database as destination and then we will establish a Zero-ETL integration between them, making data commited to the Aurora database available in Redshift in real-time for analysis.
Create Amazon Aurora Database
Navigate to the RDS page on your AWS console and click Create Database
Choose the Standard database creation method and the Aurora (MySQL Compatible) as engine type. Aurora PostgreSQL is not support as at the time of this writing.
Scroll down and select Aurora MySQL 3.03.1 (Compatible with MySQL 8.0.26) as engine version. It is important to note that the source DB cluster must be running Aurora MySQL version 3.03.1 and above.
Name your database and provide default login credentials - username and password.
Choose the Don't create an Aurora Replica option since this is for testing purposes.
Make sure the database is publically available(we will establish connection using an SQL workbench on local PC shortly) and choose the default VPC and security group for now.
Click on Create Database and wait for RDS to finish provisioning the instance.
Once the database is ready, note the Host URL of the RDS instance. You will find this by clicking on the cluster and scrolling down to the Endpoints section.
Create a Security Group that allows connection to port 3360
from Anywhere and attach it to the new database.
Click Modify on your the database and add the new security group. This will enable connection between your PC and the new instance.
Make sure to select Apply immediately during modification.
Data Ingestion using PHPMyAdmin
We will connect a local instance of PHPMyAdmin to the newly created Amazon Aurora database. phpMyAdmin is a free and open-source web-based application written in PHP that provides a graphical interface for managing MySQL and MariaDB databases. This will enable insertion of data directly into the database.
Create Local PHPMyAdmin instance using Docker
We will use docker to quickly spin up a phpMyAdmin container, making it available on port 8080
of localhost.
docker run --name phpmyadmin -e PMA_ARBITRARY=1 -p 8080:80 phpmyadmin/phpmyadmin
phpMyAdmin will now be available at localhost:8080
.
Provide the RDS instance hostname and admin credentials and click Login.
✅ Connection is now established between your local machine and the RDS instance.
Create Redhshift Serverless Cluster
Navigate to the Redshift section on the AWS console. Open the menu and Select Redshift Serverless to create a new serverless cluster.
Create Redshift Preview Workgroup
We need to create a Preview Workgroup, instead of the regular Redshift Serverless workgroup. This is beacuse, for now, the Zero-ETL integration is only available for testing purposes and accessible only on Preview Workgroups.
On the Redshift serverless dashboard, you will see an alert pane asking you to try the new Amazon Redshift features in preview. Click on the Create Preview Workgroup button. Please make sure you don't click on Create Workgroup button.
Name the workgroup and leave every other configurations on the form as is. Scroll down and click the Next button.
You will be prompted to create an associated namespace for the workgroup.
Leave other config as is, scroll down and click the Next button. Then click Create button to complete creation.
The new workgroup and namespace for Redshift Serverless is now being created. Wait for the process to finish.
Redshift serverless namespace and workgroup is now available.
Let's make sure we are able to query Redshift: Click on the Query Data
button on the top-right. You will be taken to the Redshift Query Console.
Select the Serverless: my-zero-etl-preview-workgroup dropdown to initiate a connection between the editor and redshift database.
Select Federated User and click on Create Connection.
✅ Connection is now established with Redshift.
Create Zero-ETL Integration
Now that we have created an Aurora database and a Redshift serverless database, lets setup a Zero-ETL integration between them, using Aurora as source and Redshift as target.
Configure Aurora DB Parameter Group for Zero-ETL
First we need to modify the DB parameter group of the Aurora cluster/instance because for Zero-ETL integration, specific values for the DB Cluster parameters are required e.g the Binary logging configuration.
A custom DB parameter group will be created and then it will be attached to the DB Cluster. A Custom parameter group with the below settings is needed:
aurora_enhanced_binlog=1
binlog_backup=0
binlog_format=ROW
binlog_replication_globaldb=0
binlog_row_image=full
binlog_row_metadata=full
Navigate to Parameter Groups on the RDS console. Select the Custom plane and click Create Parameter Group
Select aurora-mysql:8.0
as parameter group family. And choose Type DB Cluster Parameter Group. Give it a name and description then click Create.
✅ The Custom parameter group should now be created.
To edit the parameter group, select it and then click Actions dropdown on the top-right corner and select Edit.
Now search for each of the config that needs to be updated(based on the config settings listed above), for instance search for binlog_row_image
and set it to full
. Do same for the rest of the 6 parameters that need to be updated.
Once done, click Save Changes.
Now that the new RDS custom parameter group is ready, navigate to RDS dashboard and select the Aurora Database you created earlier, and click Modify. Scroll to the bottom and update the DB cluster parameter group to the new custom parameter group. In this case my-zero-etl-db-param
.
Click Continue and choose Apply Immediately on the final modification page, then click Modify Cluster.
RDS will now apply the new changes to the cluster. However the changes are asynchronously made, which means they wont be applied immediately. Rebooting the database is one way to ensure the changes are applied right away. So select the instance of the Aurora database and click Reboot.
Specify AWS Aurora RDS as integration source
We need to let Redshift namespace know that the Aurora database created earlier is the integration source. This can be done on the Resource Policy tab of the the namespace page. Go back to Redshift Serverless dashboard, click on the namespace you created earlier( in this case, I named mine my-zero-etl-namespace
), scroll down to the tabs and click on Resource Policy.
On the Resource Policy tab, scroll down to Authorized integration sources, click on Add authorized integration source button and paste the ARN of the AWS Aurora database created earlier.
You will find the ARN on the Configuration tab section of the database page.
Add the ARN and click Save Changes.
Complete Zero-ETL integration for RDS Aurora Cluster
On the RDS dashboard, navigate to Zero-ETL interations on the Menu. Scroll down and click the Create Zero-ETL integration button on the top-right corner of the integrations plane.
Give the integration a name, select the Aurora database and also the Redshift serverless database created earlier.
Leave every other thing as is, scroll down and click the Create Zero-ETL integration button.
Zero-ETL integration is now in progress.
✅ Zero-ETL integration is now in now done and Active.
Turn on case sensitivity for Amazon Redshift serverless
Part of the requirement for a target warehouse in Zero-ETL integration is to have enable_case_sensitive_identifier
turned on. For Amazon Redshift serverless, this can only be done via CLI.
Execute the following command via AWS CLI to turn on case sensitivity:
aws redshift-serverless update-workgroup \
--workgroup-name my-zero-etl-preview-workgroup \
--config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true
Create Zero-ETL Integration Database
Now we need to create a database for the integration. This is required to activate the integration and to access the data.
If you navigate to Zero-ETL Integrations page on the Radshift dashboard, you will see, as shown below, that a No Database alert is displayed.
Click on the integration for further details. On the page, click the Create Database form Integration button.
Give the database a name. I have named mine my_integration_database
in this case. Then click Create Database.
✅ Integration database is now set up.
Testing Zero-ETL Integration
Now that we have setup Zero-ETL integration with source DB and destination datawarehouse, we will populate the database with random data and confirm that they're replicated to Redshift.
Create database using PHPMyAdmin
Log back in to the Aurora databse using PhPMyAdmin. Then create a database called my_zero_etl_database
.
Select the newly created database and create a new SQL table called zeroetltable
with 4 columns.
I have added an id
column to the table and made it a PRIMARY KEY. This is important because creating a table that has no PRIMARY KEY will cause the ETL integration to break untill the issue is addressed.
You can name other columns as you choose. Once done, Click Go or execute the query(depending on the SQL workbench you are using).
If you navigate to the Redshift Query console, you will see that the database my_zero_etl_database
and the table zeroetltable
have been replicated automatically.
Now lets insert some data into the table.
Now if you go back to the Redshift Query console and refresh the database/table, you will find the data you just inserted there.
Conclusion
You have now successfully created a Zero-ETL integration between your source database and your destination datawarehouse(Redshift).