Lambda trigger on file upload to specified prefix
Backstory
Another story based on assigned task. There are hundreds of source systems publishing files to S3 bucket. Each one to it's own prefix. Objects keys are prepared for Athena query optimization - data is partitioned.
SNS topic is set as target for S3 Event notification for "data/" prefix. Third party system requires notification each time file is uploaded by any source system.
The task is to invoke Lambda function whenever file is uploaded by one particular source system, let's call it SystemA. Other source systems shouldn't invoke any additional actions.
S3 data structure looks like this:
Idea 1 - not good not terrible
As S3 Event notification for every file uploaded to "data/" prefix is already set, I can subscribe Lambda function to SNS topic and verify object Key each time.
Sounds good and is simple to implement, but is not efficient. On 100 uploaded files, only 1 is published by SystemA and invoking Lambda for other 99 is just waste of money.
I want something better.
Idea 2 - bad one
Next idea was to create another S3 Event notification for "data/source_system=SystemA/" prefix. Sounds simple, but it is not.Prefixes for the same event type can not overlap and "data/source_system=SystemA/" obviously overlaps "data/" prefix - what a surprise :)
Idea 3 - not so bad, but still
Let's go back to idea 1 and check SNS options. Message filtering looks promising. Messages can be filtered and Lambda is invoked only for messages that match filter.
One more time it sounds good, but SNS message filtering is based on message attributes and S3 Event Notification messages have no attributes, so nothing to filter by.
Idea 4 - almost good
Let's replace "data/" prefix event notification with one for each source system:
This is almost perfect, but hard limit for S3 event notifications is 100 so it's doable for 100 source systems.
Idea 5 - good one
Let's think about other services that can catch PutObject actions. Quite obvious one is the CloudWatch and Event Rules.
To catch S3 object level operations CloudTrail for S3 bucket must be enabled and it comes with additional costs of $0.10 per 100,000 data events. Not much, but might be significant, specially when you duplicate management events in different Trails.
CloudWach rule
Let's start with CloudWatch Rule that catches "PutObject" operations on "tb-athena-source" S3 bucket with key like "data/source_systemA/*"
Event pattern could look like this:
{
"detail-type": [
"AWS API Call via CloudTrail"
],
"source": [
"aws.s3"
],
"detail": {
"eventSource": [
"s3.amazonaws.com"
],
"requestParameters": {
"bucketName": [
"tb-athena-source"
],
"key": [
"data/source_system=SystemA/*"
]
},
"eventName": [
"PutObject"
]
}
}Looks good, but it doesn't work. Asterisk doesn't work as wildcard in this case.
Another try with Content-based Filtering with Event Patterns
Event pattern can look like this:
{
"detail-type": [
"AWS API Call via CloudTrail"
],
"source": [
"aws.s3"
],
"detail": {
"eventSource": [
"s3.amazonaws.com"
],
"requestParameters": {
"bucketName": [
"tb-athena-source"
],
"key": [
{"prefix": "data/source_system=SystemA"}
]
},
"eventName": [
"PutObject"
]
}
}However trying to set it via console ends up with following error:
Event pattern contains invalid element (can only be Strings enclosed in quotes, numbers, and the unquoted keywords true, false, and null)
Maybe CLI is better for this. Let's save event pattern in event.josn file and try:
aws --region eu-west-1 events put-rule --name put-object-test --event-pattern file://$(pwd)/event.jsonNo errors, hurray :) and event pattern is updated.
So UI is not consistent with CLI
CloudTrail
Without CloudTrail configured there's no events to catch by CloudWatch Rule, so let's create one. It's rather simple process and you should end up with similar configuration:
From now on each "Write" operation is logged by CloudTrail and only ones with "data/source_syste=SystemA" are caught by CloudWatch Rule. Just upload file to "data/source_syste=SystemA/" and check the Rule or Lambda metrics.
I uploaded 198 files to s3://tb-athena-source/data/source_syste=SystemA/ and each upload invoked Lambda function - as expected. I can find execution log for each one in the CloudWatch Logs.
Uploading files to s3://tb-athena-source/data/source_syste=SystemB/ does not invoke Lambda.
We are done, task completed let's go for a beer :)
Not yet...
CloudTrail catches all write operations to tb-athena-source S3 bucket, but we only need events for source_system=SystemA prefix. What can we do about it?
Let's go back to CloudTrail console. There is a Prefix column in Data events: S3 section so we can narrow down caught events. UI says NO WAY
With CloudWatch Rule event pattern CLI worked...
First, we need event-selectors for created trial:
aws --region eu-west-1 cloudtrail get-event-selectors --trail-name trail > event-selectors.json
It should look like this:
{
"TrailARN": "arn:aws:cloudtrail:eu-west-1:<AccountId>:trail/trail",
"EventSelectors": [
{
"ReadWriteType": "All",
"IncludeManagementEvents": true,
"DataResources": [],
"ExcludeManagementEventSources": []
},
{
"ReadWriteType": "WriteOnly",
"IncludeManagementEvents": false,
"DataResources": [
{
"Type": "AWS::S3::Object",
"Values": [
"arn:aws:s3:::tb-athena-source/"
]
}
],
"ExcludeManagementEventSources": []
}
]
}We just need EventSelectors list from this file, so use your favorite editor to remove unnecessary lines and add data/source_system=SystemA/ to DataResources list:
[
{
"ReadWriteType": "All",
"IncludeManagementEvents": true,
"DataResources": [],
"ExcludeManagementEventSources": []
},
{
"ReadWriteType": "WriteOnly",
"IncludeManagementEvents": false,
"DataResources": [
{
"Type": "AWS::S3::Object",
"Values": [
"arn:aws:s3:::tb-athena-source/data/source_system=SystemA"
]
}
],
"ExcludeManagementEventSources": []
}
]
Now use CLI to update event selectors:
aws --region eu-west-1 cloudtrail put-event-selectors \
--trail-name trail --event-selectors file://$(pwd)/event-selectors.json
Again, no errors and trail event selectors are updated with required prefix:
This setup works as expected.
CloudTrail catches write operation only for specified bucket and prefix and CloudWatch Rule invokes Lambda under the same conditions.
Conclusion
It was quite time consuming adventure, but it showed different options of target notifications.
It also showed operation discrepancies between AWS Console and CLI.
Be careful with CloudTrail
When you set up CloudTrail trail, you must create new or provide already existing S3 bucket to store log files. Next you can select for which buckets events are logged. It's tempting to set Read and Write for "All current and future S3 buckets".
Doing so with Trail Logs bucket on the same account ends up with increased volume of log files with logged operation of writing log file to Trail Logs S3 bucket.
The simplest solution is to create new AWS account only for Audit/Trail logs.




Komentarze
Prześlij komentarz