Technical Stuff: RedShift

Pages

Showing posts with label RedShift. Show all posts

Thursday, August 25, 2016

Duplicate message reading from SQS - AWS !!

Recently we faced one issue in reading messages from SQS in AWS cloud where we are processing same message multiple times. This issue we identified by using messsage identifier(mid) of the each message in AWS REDSHIFT table column.

How our system works:

Post messages to AWS SQS using one task
Read batch of messages from SQS and start processing json message
Validate each message in a batch and put in AWS S3 bucket
Load into AWS REDSHIFT database
Delete batch of messages from SQS after succesfuly processing messages

As I mentioned earlier we faced an issue of some of the messages processing multiple times and this was happening in only one environment and not all environments. To find RCA for this, it took almost three days and below is the RCA. Possible reasons for this to occur is,

Reading a message from SQS and not deleting - if this is the case messages never deletes from SQS and all messages should process multiple times. But this is not happening. Messages are deleting but some messages are processing multiple times
Reading same message by multiple tasks to process - This is happening in our scenario.
Task1 reading batch of messages from SQS and before deleting all theses messages some other task picking some messages from task1 read messages. This is due to visibility time out of message in the SQS. Lets see how this happens.

When a task read batch of messages from SQS , these messages are moved to inflight mode and not visible to other task to read as these messages are under processing. And SQS has a property called visibility time out, so message in the inflight mode message are not visible to other tasks until this time out completes.

Before this time out expires, we need to complete our process of message validation, loading to S3, storing to REDSHIFT and deleting. In our case some of the messages are not completing this process(or not deleting) with in the visibility time out(10sec in this case). So causing the message visible again after the visibility timeout expires.

As I mentioned this was happening only one environment because, in that environment only we have visibility time out 10secs and all other enviroments we have 20secs. Our task is not completing with in 10secs and causing the duplicate message processing issue. To solve this issue, we just increased visibility time out to 20secs.

I hope this helps.

Happy Reading!!!

No comments:

Labels: AWS, RedShift, S3 Bucket, SQS

Saturday, February 20, 2016

S3 load errors in Redshift(AWS) COPY command

We have faced lot of weird issues while loading S3 bucket files into redshift. I will try to explain all issues what we faced. Before going that , lets see what are Valid S3 file should contain

No.of values in S3 bucket are exactly equal to the no.of columns in the redshift table
Each value in S3 separated with a delimiter, in our case its pipe(|)
Each line in S3 file is exactly one insert statement on redshift
Empty values will be passed in the S3 file for corresponding optional field in table

To store S3 file content to redshift database, AWS provides a COPY command which stores bulk or batch of S3 data into redshift.
Lets assume there is a table testMessage in redshift which has three columns id of integer type, name of varchar(10) type and msg of varchar(10) type.

S3 file to redshift inserting COPY command is below

copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_'

To insert values from S3 file, sammple S3 file could be

77|chandu|chanduthedev

In this file total values are three which is equal to no.of columns in the testMessage table columns and each value separated by pipe(|) symbol.

Lets see another S3 sample file
88||chanduthedev

In this file, we have one empty value for name column in table testMessage in redshift. So far so good. Lets take some S3 files which cause to fail redshift COPY command

99|chanduthedev

In this S3 file contains only two values 99 and chanduthdev, and missing third value which causes to file S3 load COPY command

99|ch
and u|chanduthedev

In this file, second value is ch\nand u which conatins new line(\n) characters, so it becomes two rows in the S3 file which means two insert statements to redshift COPY command and. First row becomes two value insert statments which is invalid and second one is another two value invalid statement.

For these invalid S3 file you may get below error message.

Load into table 'testMessage' failed. Check 'stl_load_errors' system table for details
and in AWS you may get below error
Delimiter not found

Lets take another failure S3 file which has delimiter as value for name column

77|chan|234|chanduthedev

In the above S3 file, it looks 4 values because of extra pipe(|) character for the name chan|1234 which causes redshift COPY command to treat S3 file has four values, but table has three values.

For S3 load failures, the most common reason could be special characters or escape characters like new line(\n), double quotes("), single quotes etc. Either you need to escape those special characters or remove those special characters.

We followed later idea of removing special charasters while processing and storing in the redshift. But later came to know that we can use ESCAPE key word in COPY command.

copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

adding ESCAPE to COPY command will solve lot of these issues. So always check for ESCPAPE

Happy Debugging...