Have you ever tried figuring out how large a given folder is in S3 that contains millions of objects? Have you ever wondered what percentage of your objects are encrypted? Determined if there are any objects with public object ACLs? Unless you have decades to wait for the completion of an aws cli command, S3 inventory reports is the answer to the questions you may have had.
This blog post will go over how to setup S3 inventory reports, query that data in parquet, and help you identify where unnecessary data might be stored. This goes beyond the AWS documentation regarding querying your S3 inventory with Amazon Athena with examples, infrastructure, and a deep dive into the setup.
What are S3 Inventory Reports?
Amazon S3 Inventory is a feature of Amazon S3 that enables users to generate reports about the objects stored in their S3 bucket. These reports provide information about the objects such as object key, size, ETag, version, storage class, and metadata. S3 Inventory reports can be used for a variety of purposes, such as tracking changes to objects, identifying stale objects that can be deleted, and auditing the access permissions of objects.
There are three types of S3 Inventory reports output formats: CSV, Apache optimized row columnar (ORC), and Apache Parquet. CSV reports are simple, comma-separated value files that can be opened in a text editor or spreadsheet application. Parquet and ORC are a columnar storage format that is optimized for big data analytics and is supported by a number of data processing systems such as Athena. S3 Inventory reports can be generated on a daily or weekly basis and delivered to an s3 bucket of your choosing.
S3 Inventory can be configured through the Amazon S3 Management Console, AWS CLI, or the Amazon S3 API. Users can specify the bucket to be inventoried, the destination for the reports, and the frequency of the report generation. Users can also specify optional filters to include or exclude objects based on their key prefix or tag. In addition, users can enable data integrity validation to ensure the accuracy of the report by including the ETag and version ID of each object in the report.
Why use S3 Inventory Reports?
Inventory reporting can help you ask a wide variety of questions about the contents of your S3 bucket especially when that bucket is extremely large. Do you have a multi-petabyte datalake bucket? Logging bucket with millions of objects? Inventory reporting can give you an idea as to what lies within them without hitting API limits and waiting for an extended period of time. Below are some of the questions they can help answer.
How old are objects within my bucket?
Size of objects within the bucket?
Size of keys/folders within the bucket?
What objects are encrypted?
What storage tiers do my objects live in?
How much data will be deleted when implementing a lifecycle policy?
What can inventory reports help you achieve?
Why should you setup S3 inventory reporting? Often times when buckets go over millions of objects, getting an idea of where your data is can become challenging. Here are some items you might want to achieve which inventory reporting can help with.
Data that has been forgotten about that should be deleted
Identify large directories
Optimize file sizes per object
Track objects and sizes over time
What is in Inventory Reports?
The following below contains all the data points available in S3 inventory reports.
Bucket
Key
Version
Is Latest
Deletion Markers
Size
Last Modified Date
e_tag
Storage Class
Multipart Upload Flag
Replication Status
Encryption Status
Object Lock Retain Until Date
Object Lock Mode
Object Lock Legal Hold Status
Intelligent Tiering Access Tier
Bucket Key Status
Checksum Algorithm
Querying the Data
This section will go over how to make this data valuable to you and how to retrieve certain information about the bucket through AWS Athena. Below are some assumptions we are making about your setup and involves how it was setup through our terraform module. We will do a deep dive into the infrastructure shortly after.
Date
Depending on the time of day, the last inventory report was most likely obtained the previous day. In which you need to specify the dt relative to yesterday. You can do so by performing the following within your where clause.
Query Objects
To give you an example as to what the data looks like, this query allows you to get the first 10 rows in the table.
Size of the Bucket
To calculate the total size of the bucket you can perform this query.
Calculate Folder/Key Size
To calculate the size for a “folder”, S3 does not have the concept of a ‘Directory’ or a ‘Folder’ but common key paths, you can perform this query. In this example we will try figure out how large cloudtrail logs are per region with the keys s3://example-bucket/CloudTrail/0000000000/us-east-1.
Calculate Lifecycle Removal
This query retrieves the sum of bytes to be removed for files older than 90 days within the key/directory directory/test/%.
Deployment Deep Dive
Caution
Be sure to wait a full day for inventory reports to get delivered!
To setup S3 inventory reports on your S3 bucket, you can leverage our terraform module on GitHub. Below is how to do so through terragrunt.
Terraform: Inventory Reports
Quickly diving into the infrastructure, this creates the inventory configuration for every bucket you have specified in your s3_inventory_configuration variable with their respective configuration.
Terraform: Glue Table and Database
This is the infrastructure necessary to query your data with Athena.
Conclusion
Hopefully this helped you deploy s3 inventory reports, query that data with Athena, and ask questions about your objects you may have not known how to do so before.