郑州市网站建设,专业的网站制作正规公司,知名网页设计师,用phpcms v9搭建手机网站后您没有访问该信息的权限!AWS Dynamodb简介
Amazon DynamoDB 是一种完全托管式、无服务器的 NoSQL 键值数据库#xff0c;旨在运行任何规模的高性能应用程序。DynamoDB能在任何规模下实现不到10毫秒级的一致响应#xff0c;并且它的存储空间无限#xff0c;可在任何规模提供可靠的性能。DynamoDB 提…AWS Dynamodb简介
Amazon DynamoDB 是一种完全托管式、无服务器的 NoSQL 键值数据库旨在运行任何规模的高性能应用程序。DynamoDB能在任何规模下实现不到10毫秒级的一致响应并且它的存储空间无限可在任何规模提供可靠的性能。DynamoDB 提供内置安全性、连续备份、自动多区域复制、内存缓存和数据导出工具。
Redshift简介
Amazon Redshift是一个快速、功能强大、完全托管的PB级别数据仓库服务。用户可以在刚开始使用几百GB的数据然后在后期扩容到PB级别的数据容量。Redshift是一种联机分析处理OLAPOnline Analytics Processing的类型支持复杂的分析操作侧重决策支持并且能提供直观易懂的查询结果。
资源准备
VPC
vpc cird block 10.10.0.0/16internet gatewayelastic ip addressnat gateway使用elastic ip address作为public ippublic subnet 三个Availability Zoneprivate subnet 三个Availability Zonepublic route tablepublic subnet关联的route table destination 0.0.0.0/0 target: internet-gateway-id(允许与外界进行通信)destination10.10.0.0/16 local内部通信private route tableprivate subnet关联的route table destination10.10.0.0/16 local内部通信destination 0.0.0.0/0 target: nat-gateway-id(允许内部访问外界)web server security group 允许任意ip对443端口进行访问允许自己的ipdui22端口进行访问以便ssh到服务器上向数据库插入数据glue redshift connection security group 只包含一条self-referencing rule 允许同一个security group对所有tcp端口进行访创建Glue connection时需要使用该security groupReference glue connection security group must have a self-referencing rule to allow to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.private redshift security group 允许vpc内部10.10.0.0/24对5439端口进行访问允许glue connection security group对5439端口进行访问public redshift security group 允许vpc内部10.10.0.0/24对5439端口进行访问允许kenisis firehose所在region的public ip 对5439端口进行访问 13.58.135.96/27 for US East (Ohio) 52.70.63.192/27 for US East (N. Virginia) 13.57.135.192/27 for US West (N. California) 52.89.255.224/27 for US West (Oregon) 18.253.138.96/27 for AWS GovCloud (US-East) 52.61.204.160/27 for AWS GovCloud (US-West) 35.183.92.128/27 for Canada (Central) 18.162.221.32/27 for Asia Pacific (Hong Kong) 13.232.67.32/27 for Asia Pacific (Mumbai) 13.209.1.64/27 for Asia Pacific (Seoul) 13.228.64.192/27 for Asia Pacific (Singapore) 13.210.67.224/27 for Asia Pacific (Sydney) 13.113.196.224/27 for Asia Pacific (Tokyo) 52.81.151.32/27 for China (Beijing) 161.189.23.64/27 for China (Ningxia) 35.158.127.160/27 for Europe (Frankfurt) 52.19.239.192/27 for Europe (Ireland) 18.130.1.96/27 for Europe (London) 35.180.1.96/27 for Europe (Paris) 13.53.63.224/27 for Europe (Stockholm) 15.185.91.0/27 for Middle East (Bahrain) 18.228.1.128/27 for South America (São Paulo) 15.161.135.128/27 for Europe (Milan) 13.244.121.224/27 for Africa (Cape Town) 13.208.177.192/27 for Asia Pacific (Osaka) 108.136.221.64/27 for Asia Pacific (Jakarta) 3.28.159.32/27 for Middle East (UAE) 18.100.71.96/27 for Europe (Spain) 16.62.183.32/27 for Europe (Zurich) 18.60.192.128/27 for Asia Pacific (Hyderabad)
VPC全部资源的serverless文件
custom:bucketNamePrefix 替换为自己的创建的bucket service: dynamodb-to-redshift-vpccustom:bucketNamePrefix: jessicaprovider:name: awsregion: ${opt:region, ap-southeast-1}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:VpcName:Type: StringDefault: test-vpcResources:VPC:Type: AWS::EC2::VPCProperties:CidrBlock: 10.10.0.0/16EnableDnsSupport: trueEnableDnsHostnames: trueInstanceTenancy: defaultTags:- Key: NameValue: !Sub VPC_${VpcName}# Internet GatewayInternetGateway:Type: AWS::EC2::InternetGatewayProperties:Tags:- Key: NameValue: !Sub VPC_${VpcName}_InternetGatewayVPCGatewayAttachment:Type: AWS::EC2::VPCGatewayAttachmentProperties:VpcId: !Ref VPCInternetGatewayId: !Ref InternetGateway# web server security groupWebServerSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from publicVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 443ToPort: 443CidrIp: 0.0.0.0/0Tags:- Key: NameValue: !Sub VPC_${VpcName}_WebServerSecurityGroup# public route tableRouteTablePublic:Type: AWS::EC2::RouteTableProperties:VpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_RouteTablePublicRouteTablePublicInternetRoute:Type: AWS::EC2::RouteDependsOn: VPCGatewayAttachmentProperties:RouteTableId: !Ref RouteTablePublicDestinationCidrBlock: 0.0.0.0/0GatewayId: !Ref InternetGateway# public subnetSubnetAPublic:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [0, !GetAZs ]CidrBlock: 10.10.0.0/24MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetAPublicRouteTableAssociationAPublic:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetAPublicRouteTableId: !Ref RouteTablePublicSubnetBPublic:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [1, !GetAZs ]CidrBlock: 10.10.32.0/24MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetBPublicRouteTableAssociationBPublic:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetBPublicRouteTableId: !Ref RouteTablePublicSubnetCPublic:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [2, !GetAZs ]CidrBlock: 10.10.64.0/24MapPublicIpOnLaunch: trueVpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetCPublicRouteTableAssociationCPublic:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetCPublicRouteTableId: !Ref RouteTablePublic# redshift security groupPrivateRedshiftSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from inside vpcVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 10.10.0.0/24- IpProtocol: tcpFromPort: 5439ToPort: 5439SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdTags:- Key: NameValue: !Sub VPC_${VpcName}_PrivateRedshiftSecurityGroup# redshift security groupPublicRedshiftSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow access from inside vpc and Kinesis Data Firehose CIDR blockVpcId: !Ref VPCSecurityGroupIngress:- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 10.10.0.0/24- IpProtocol: tcpFromPort: 5439ToPort: 5439CidrIp: 13.228.64.192/27Tags:- Key: NameValue: !Sub VPC_${VpcName}_PublicRedshiftSecurityGroupGlueRedshiftConnectionSecurityGroup:Type: AWS::EC2::SecurityGroupProperties:GroupDescription: Allow self referring for all tcp portsVpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_GlueRedshiftConnectionSecurityGroupGlueRedshiftConnectionSecurityGroupSelfReferringInboundRule:Type: AWS::EC2::SecurityGroupIngressProperties:GroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdIpProtocol: tcpFromPort: 0ToPort: 65535SourceSecurityGroupId: !GetAtt GlueRedshiftConnectionSecurityGroup.GroupIdSourceSecurityGroupOwnerId: !Sub ${aws:accountId}# nat gatewayEIP:Type: AWS::EC2::EIPProperties:Domain: vpcNatGateway:Type: AWS::EC2::NatGatewayProperties:AllocationId: !GetAtt EIP.AllocationIdSubnetId: !Ref SubnetAPublic# private route tableRouteTablePrivate:Type: AWS::EC2::RouteTableProperties:VpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_RouteTablePrivateRouteTablePrivateRoute:Type: AWS::EC2::RouteProperties:RouteTableId: !Ref RouteTablePrivateDestinationCidrBlock: 0.0.0.0/0NatGatewayId: !Ref NatGateway# private subnetSubnetAPrivate:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [0, !GetAZs ]CidrBlock: 10.10.16.0/24VpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetAPrivateRouteTableAssociationAPrivate:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetAPrivateRouteTableId: !Ref RouteTablePrivateSubnetBPrivate:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [1, !GetAZs ]CidrBlock: 10.10.48.0/24VpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetBPrivateRouteTableAssociationBPrivate:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetBPrivateRouteTableId: !Ref RouteTablePrivateSubnetCPrivate:Type: AWS::EC2::SubnetProperties:AvailabilityZone: !Select [2, !GetAZs ]CidrBlock: 10.10.80.0/24VpcId: !Ref VPCTags:- Key: NameValue: !Sub VPC_${VpcName}_SubnetCPrivateRouteTableAssociationCPrivate:Type: AWS::EC2::SubnetRouteTableAssociationProperties:SubnetId: !Ref SubnetCPrivateRouteTableId: !Ref RouteTablePrivateOutputs:VPC:Description: VPC.Value: !Ref VPCExport:Name: !Sub ${self:provider.stackName}SubnetsPublic:Description: Subnets public.Value:!Join [,,[!Ref SubnetAPublic, !Ref SubnetBPublic, !Ref SubnetCPublic],]Export:Name: !Sub ${self:provider.stackName}-PublicSubnetsSubnetsPrivate:Description: Subnets private.Value:!Join [,,[!Ref SubnetAPrivate, !Ref SubnetBPrivate, !Ref SubnetCPrivate],]Export:Name: !Sub ${self:provider.stackName}-PrivateSubnetsDefaultSecurityGroup:Description: VPC Default Security GroupValue: !GetAtt VPC.DefaultSecurityGroupExport:Name: !Sub ${self:provider.stackName}-DefaultSecurityGroupWebServerSecurityGroup:Description: VPC Web Server Security GroupValue: !Ref WebServerSecurityGroupExport:Name: !Sub ${self:provider.stackName}-WebServerSecurityGroupPrivateRedshiftSecurityGroup:Description: The id of the RedshiftSecurityGroupValue: !Ref PrivateRedshiftSecurityGroupExport:Name: !Sub ${self:provider.stackName}-PrivateRedshiftSecurityGroupPublicRedshiftSecurityGroup:Description: The id of the RedshiftSecurityGroupValue: !Ref PublicRedshiftSecurityGroupExport:Name: !Sub ${self:provider.stackName}-PublicRedshiftSecurityGroupGlueRedshiftConnectionSecurityGroup:Description: The id of the self referring security groupValue: !Ref GlueRedshiftConnectionSecurityGroupExport:Name: !Sub ${self:provider.stackName}-GlueSelfRefringSecurityGroupRedshift Cluster
Private Cluster subnet group 创建一个包含private subnet的private subnet groupPrivate Cluster用于测试glue job同步数据到redshiftPubliclyAccessible必须设为false否则glue job无法连接 ClusterSubnetGroupName 使用private subnet groupVpcSecurityGroupIds 使用private redshift security groupNodeType: dc2.largeClusterType: single-node PubliclyAccessible: false Public Cluster subnet group 创建一个包含public subnet的public subnet groupPublic Cluster用于测试glue job同步数据到redshiftPubliclyAccessible必须设为true且security group允许kinesis firehose public ip对5439端口进行访问否则firehose无法连接到redshift ClusterSubnetGroupName 使用public subnet groupVpcSecurityGroupIds 使用public redshift security groupNodeType: dc2.largeClusterType: single-nodePubliclyAccessible: true
redshift全部资源的serverless文件:
custom:bucketNamePrefix 替换为自己的创建的bucket service: dynamodb-to-redshift-redshiftcustom:bucketNamePrefix: jessicaprovider:name: awsregion: ${opt:region, ap-southeast-1}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:ServiceName:Type: StringDefault: dynamodb-to-redshiftResources:PrivateClusterSubnetGroup:Type: AWS::Redshift::ClusterSubnetGroupProperties:Description: Private Cluster Subnet GroupSubnetIds:Fn::Split:- ,- Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnetsTags:- Key: NameValue: private-subnetPrivateCluster:Type: AWS::Redshift::ClusterProperties:ClusterIdentifier: test-data-sync-redshiftClusterSubnetGroupName: !Ref ClusterSubnetGroupVpcSecurityGroupIds:- Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateRedshiftSecurityGroupDBName: devMasterUsername: adminMasterUserPassword: Redshift_admin_2022NodeType: dc2.largeClusterType: single-nodePubliclyAccessible: falsePublicClusterSubnetGroup:Type: AWS::Redshift::ClusterSubnetGroupProperties:Description: Public Cluster Subnet GroupSubnetIds:Fn::Split:- ,- Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicSubnetsTags:- Key: NameValue: public-subnetPublicCluster:Type: AWS::Redshift::ClusterProperties:ClusterIdentifier: test-data-sync-redshift-publicClusterSubnetGroupName: !Ref PublicClusterSubnetGroupVpcSecurityGroupIds:- Fn::ImportValue: !Sub ${ServiceName}-vpc-PublicRedshiftSecurityGroupDBName: devMasterUsername: adminMasterUserPassword: Redshift_admin_2022NodeType: dc2.largeClusterType: single-nodePubliclyAccessible: trueOutputs:PrivateRedshiftEndpoint:Description: Redshift endpointValue: !GetAtt Cluster.Endpoint.AddressExport:Name: !Sub ${self:provider.stackName}-PrivateRedshiftEndpointPrivateRedshiftPort:Description: Redshift portValue: !GetAtt Cluster.Endpoint.PortExport:Name: !Sub ${self:provider.stackName}-PrivateRedshiftPortPublicRedshiftEndpoint:Description: Public Redshift endpointValue: !GetAtt PublicCluster.Endpoint.AddressExport:Name: !Sub ${self:provider.stackName}-PublicRedshiftEndpointPublicRedshiftPort:Description: Public Redshift portValue: !GetAtt PublicCluster.Endpoint.PortExport:Name: !Sub ${self:provider.stackName}-PublicRedshiftPort使用AWS Glue ETL Job进行同步
适用场景
一次性整表同步对于典型的时间序列数据当前的数据写入和读取频率高越老的数据读写频率越低通常会采用为每个时间段每天创建一张表的方式来合理的分配WCU和RCU。如果在当时时间段结束之后需要对该时间段内的所有数据进行复杂的分析操作则需要将dynamodb的整表同步到redshift
架构 优点
使用AWS Glue Crawler可以自动管理源表和目标表的scheme在Glue Job script中可以省去mapping的过程Glue Job script代码易维护
资源部署
Dynamodb table: 源数据表IAM role for glue crawlercrawler需要连接dynamodb和redshift的权限以读取表的schemeDynamodb glue catalog database用于存储crawler生成的dynamodb table schemeRedshift glue catalog database用于存储crawler生成的redshift table schemeDynamodb glue crawler用于读取dynamodb表生成对应的dynamodb table schemeRedshift glue crawler:用于读取redshift表生成对应的redshift table schemeGlue connection:glue job连接redshift需要用到的connectionIAM role for glue jobGlue job需要S3 bucket for glue jobglue job
如何部署 sls deploy -c glue-etl.yml
#replace ${bucketNamePrefix} to your own glue bucket name crate in glue-etl.yml
aws s3 cp dynamodb-to-redshift.py s3://com.${bucketNamePrefix}.glue-temp-bucket/script/
部署文件glue-etl.yml service: dynamodb-to-redshift-glue-etlcustom:bucketNamePrefix: jessicaprovider:name: awsregion: ${opt:region, ap-southeast-1}stackName: ${self:service}deploymentBucket:name: com.${self:custom.bucketNamePrefix}.deploy-bucketserverSideEncryption: AES256resources:Parameters:DynamodbTableName:Type: StringDefault: TestSyncToRedshiftServiceName:Type: StringDefault: dynamodb-to-redshiftGlueBucketName:Type: StringDefault: com.${self:custom.bucketNamePrefix}.glue-etl-temp-bucketResources:TestTable:Type: AWS::DynamoDB::TableProperties:TableName: !Sub ${DynamodbTableName}BillingMode: PAY_PER_REQUESTAttributeDefinitions:- AttributeName: pkAttributeType: S- AttributeName: skAttributeType: SKeySchema:- AttributeName: pkKeyType: HASH- AttributeName: skKeyType: RANGECrawlerRole:Type: AWS::IAM::RoleProperties:RoleName: CrawlerRoleAssumeRolePolicyDocument:Version: 2012-10-17Statement:- Effect: AllowPrincipal:Service:- glue.amazonaws.comAction:- sts:AssumeRoleManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole- arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess- arn:aws:iam::aws:policy/AmazonRedshiftFullAccess- arn:aws:iam::aws:policy/AmazonS3FullAccessDynamodbDatabase:Type: AWS::Glue::DatabaseProperties:CatalogId: !Ref AWS::AccountIdDatabaseInput:Name: dynamodb-databaseDynamodbCrawler:Type: AWS::Glue::CrawlerProperties:Name: dynamodb-crawlerConfiguration:Role: !GetAtt CrawlerRole.ArnDatabaseName: !Ref DynamodbDatabaseTargets:DynamoDBTargets:- Path: !Sub ${DynamodbTableName}SchemaChangePolicy:UpdateBehavior: UPDATE_IN_DATABASEDeleteBehavior: LOGSchedule:ScheduleExpression: cron(0/10 * * * ? *) # run every 10 minutesGlueRedshiftConnection:Type: AWS::Glue::ConnectionProperties:CatalogId: !Sub ${aws:accountId}ConnectionInput:Name: ${self:service}-redshift-connectionConnectionType: JDBCMatchCriteria: []PhysicalConnectionRequirements:SecurityGroupIdList:- Fn::ImportValue: !Sub ${ServiceName}-vpc-GlueSelfRefringSecurityGroupSubnetId:Fn::Select:- 1- Fn::Split:- ,- Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnetsConnectionProperties:JDBC_CONNECTION_URL:Fn::Join:- - - jdbc:redshift://- Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftEndpoint- :- Fn::ImportValue: !Sub ${ServiceName}-redshift-PrivateRedshiftPort- /devJDBC_ENFORCE_SSL: falseUSERNAME: adminPASSWORD: Redshift_admin_2022RedshiftDatabase:Type: AWS::Glue::DatabaseProperties:CatalogId: !Ref AWS::AccountIdDatabaseInput:Name: redshift-databaseRedshiftCrawler:Type: AWS::Glue::CrawlerProperties:Name: redshift-crawlerConfiguration:Role: !GetAtt CrawlerRole.ArnDatabaseName: !Ref RedshiftDatabaseTargets:JdbcTargets:- ConnectionName: !Ref GlueRedshiftConnectionPath: dev/public/test_sync_to_redshiftSchemaChangePolicy:UpdateBehavior: UPDATE_IN_DATABASEDeleteBehavior: LOGRedshiftGlueJobRole:Type: AWS::IAM::RoleProperties:RoleName: RedshiftGlueJobRoleAssumeRolePolicyDocument:Version: 2012-10-17Statement:- Effect: AllowPrincipal:Service:- glue.amazonaws.comAction: sts:AssumeRoleManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole- arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess- arn:aws:iam::aws:policy/AmazonRedshiftFullAccess- arn:aws:iam::aws:policy/AmazonS3FullAccess- arn:aws:iam::aws:policy/CloudWatchLogsFullAccessGlueTempBucket:Type: AWS::S3::BucketProperties:BucketName: !Sub ${GlueBucketName}GlueJob:Type: AWS::Glue::JobProperties:Name: dynamodb-to-redshift-glue-etl-jobRole: !GetAtt RedshiftGlueJobRole.ArnCommand:Name: glueetlScriptLocation: !Sub s3://${GlueBucketName}/script/dynamodb-to-redshift.pyPythonVersion: 3DefaultArguments:--TempDir: !Sub s3://${GlueBucketName}/tmp/dynamodb-to-redshift/WorkerType: G.1XNumberOfWorkers: 2GlueVersion: 3.0Connections:Connections:- !Ref GlueRedshiftConnectionglue job脚本dynamodb-to-redshift.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Jobparams [JOB_NAME,TempDir,
]args getResolvedOptions(sys.argv, params)
sc SparkContext()
glueContext GlueContext(sc)
spark glueContext.spark_session
job Job(glueContext)
job.init(args[JOB_NAME], args)DynamoDBtable_node1 glueContext.create_dynamic_frame.from_catalog(databasedynamodb-database,table_nametestsynctoredshift,transformation_ctxDynamoDBtable_node1,
)RedshiftCluster_node2 glueContext.write_dynamic_frame.from_catalog(frameDynamoDBtable_node1,databaseredshift-database,table_namedev_public_test_sync_to_redshift,redshift_tmp_dirargs[TempDir],transformation_ctxRedshiftCluster_node2,
)job.commit()测试 insert some data to dynamodb table with aws web console first, otherwise, the crawler can not detect the table scheme 2. run dynamodb-crawler, after run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?regionap-southeast-1#catalog:tabtables) 3. create redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?regionap-southeast-1#/client) CREATE TABLE public.test_sync_to_redshift(pk varchar(200) not null, sk varchar(200) NOT NULL, primary key(pk, sk)); 4. run redshift-crawler, if encounter no valid connection error, please update password in the redshift-connection manually with aws console, dont know why the password is not correct when deploy with cloudformation. After run success, you can see the database and table in [glue console](https://ap-southeast-1.console.aws.amazon.com/glue/home?regionap-southeast-1#catalog:tabtables) 5. run glue etl job, after run success, you can check data in redshift table with [Redshift query editor v2](https://ap-southeast-1.console.aws.amazon.com/sqlworkbench/home?regionap-southeast-1#/client). This glue etl job will insert all data in dynamodb table to redshift table directly, as for redshift, [primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html), so if you run the jon serval times, you will see duplicate data with some primary key in redshift table.
使用AWS Glue Streaming Job进行同步
适用场景
持续增量同步表的操作支持插入更新和删除
架构图 资源部署
Dynamodb表VPCRedshift ClusterGlue CrawlerGlueJob
优点
可以支持表的插入更新和删除操作的同步
缺点 使用AWS kinesis Firehose进行同步
适用场景
持续增量同步表的操作只支持插入部分支持更新不支持删除比如记录传感器每秒收集的数据记录网站用户的行为事件 由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的而redshift是不保证primary key的唯一性对redshift来说primary key只是提供信息并没有保证primary key的唯一性如果在COPY命令的源数据中包含多条primary key相同的数据比如对一条数据进行多次修改则会导致redshift表出现多条primary key相同的数据。部分支持更新的意思就是如果多条primary key相同的数据对你的业务逻辑没有影响那也可以使用AWS kinesis Firehose进行同步如果多条primary key对业务逻辑有影响那就不可以使用由于kinesis firehose是通过Redshift COPY命令与redshift进行集成的COPY命令是不支持删除的
架构图
资源部署 Reference
Setting up networking for development for AWS Glue - AWS Glue