S3 File Management With The Boto3 Python SDK

S3 File Management With The Boto3 Python SDK

Modify and manipulate thousands of files in your S3 (or DigitalOcean) Bucket.

Todd Birchard
Todd Birchard

It's incredible the things human beings can adapt to in life-or-death circumstances, isn't it? In this particular case it wasn't my personal life in danger, but rather the life of this very blog. I will allow for a brief pause while the audience shares gasps of disbelief. We must stay strong and collect ourselves from such distress.

Like most things I despise, the source of this unnecessary headache was a SaaS product. I won't name any names here, but it was Cloudinary. Yep, totally them. We'd been using their (supposedly) free service for hosting our blog's images for about a month now. This may be a lazy solution to a true CDN, sure, but there's only so much we can do when well over half of Ghost's 'officially recommended' storage adapters are depreciated or broken. That's a whole other thing.

I'll spare the details, but at some point we reached one of the 5 or 6 rate limits on our account which had conveniently gone unmentioned (official violations include storage, bandwidth, lack of galactic credits, and a refusal to give up Park Place from the previously famous McDonalds Monopoly game- seriously though, why not ask for Broadway)? The terms were simple: pay 100 dollars of protection money to the sharks a matter of days. Or, ya know, don't.

Weapons Of Mass Content Delivery

Hostage situations aside, the challenge was on: how could move thousands of images to a new CDN within hours of losing all of our data, or without experiencing significant downtime? Some further complications:

  • There’s no real “export” button on Cloudinary. Yes, I know, they’ve just recently released some rest API that may or may not generate a zip file of a percentage of your files at a time. Great.
  • We’re left with 4-5 duplicates of every image. Every time a transform is applied to an image, it leaves behind unused duplicates.
  • We need to revert to the traditional YYYY/MM folder structure, which was destroyed.

This is gonna be good. You'd be surprised what can be Macgyvered out of a single Python Library and a few SQL queries. Let's focus on Boto3 for now.

Boto3: It's Not Just for AWS Anymore

DigitalOcean offers a dead-simple CDN service which just so happens to be fully compatible with Boto3. Let's not linger on that fact too long before we consider the possibility that DO is just another AWS reseller. Moving on.

Initial Configuration

Setting up Boto3 is simple just as long as you can manage to find your API key and secret:

"""Initialize session client with DigitalOcean Spaces or AWS S3."""
from os import getenv
from botocore.client import Config
import boto3


session = boto3.session.Session()
client = session.client(
    's3',
    region_name='nyc3',
    endpoint_url='https://nyc3.digitaloceanspaces.com',
    aws_access_key_id=getenv('KEY'),
    aws_secret_access_key=getenv('SECRET')
)
Create boto3 session & configure an S3 client

From here forward, whenever we need to reference our 'bucket', we do so via client.

Fast Cut Back To Our Dramatic Storyline

In our little scenario, I took a first stab at populating our bucket as a rough pass. I created our desired folder structure and tossed everything we owned hastily into said folders, mostly by rough guesses and by gauging the publish date of posts. So we've got our desired folder structure, but the content is a mess.

CDN
├── Posts
│   ├── /2017
│   │   ├── 11
│   ├── /2018
│   │   ├── 03
│   │   ├── 04
│   │   ├── 05
│   │   ├── 06
│   │   ├── 07
│   │   ├── 08
│   │   ├── 09
│   │   ├── 10
│   │   ├── 11
│   │   └── 12
│   ├── /2019
│   │   ├── 01
│   │   └── 02
│   └── /lynx
├── /bunch
├── /of
├── /other
└── /shit
CDN File Structure

So we're dealing with a three-tiered folder hierarchy here. You're probably thinking "oh great, this is where we recap some basics about recursion for the 1ooth time..." but you're wrong! Boto3 deals with the pains of recursion for us if we so please. If we were to run client.list_objects_v2() on the root of our bucket, Boto3 would return the file path of every single file in that bucket regardless of where it lives.

Letting an untested script run wild and make transformations to your production data sounds like fun and games, but I'm not willing to risk losing the hundreds of god damned Lynx pictures I draw every night for a mild sense of amusement. Instead, we're going to have Boto3 loop through each folder one at a time so when our script does break, it'll happen in a predictable way that we can just pick back up. I guess that means.... we're pretty much opting into recursion. Fine, you were right.

The Art of Retrieving Objects

Running client.list_objects_v2() sure sounded straightforward when I omitted all the details, but this method can achieve some quite powerful things for its size. list_objects_v2 is essentially our bread and butter behind this script. "But why list_objects_v2 instead of list_objects," you may ask? I don't know, because AWS is a bloated shit show? Does Amazon even know? Why don't we ask their documentation?

Well that explains... Nothing.
Well that explains... Nothing.

Well, I'm sure list_objects had a vulnerability or something. Surely it's been sunsetted by now. Anything else just wouldn't make any sense.

...Oh. It's right there. Next to version 2.
...Oh. It's right there. Next to version 2.

That's the last time I'll mention that AWS sucks in this post... I promise.

Getting All Folders in a Subdirectory

To humor you, let's see what getting all objects in a bucket would look like:

...

def get_all_objects():
    """Recursively fetch folders & objects in directory."""
    return client.list_objects_v2(
        Bucket='hackers',
        Delimiter='',
        EncodingType='url',
        MaxKeys=1000,
        Prefix='/',
        ContinuationToken='',
        FetchOwner=False,
        StartAfter=''
      )
Fetch all objects in an S3 directory & subdirectories.

We've passed pretty much nothing meaningful to list_objects_v2(), so it will come back to us with every file, folder, woman and child it can find in your poor bucket with great vengeance and furious anger:

oh god oh god oh god
oh god oh god oh god

Here, I'll even be fair and only return the file names/paths instead of each object:

Ah yes, totally reasonable for thousands of files.
Ah yes, totally reasonable for thousands of files

Instead, we'll solve this like Gentlemen. Oh, but first, let's clean those god-awful strings being returned as keys. That simply won't do, so build yourself a function. We'll need it.

...
from urllib.parse import unquote

...


def sanitize_object_key(obj):
    """Replace character encodings with actual characters."""
    new_key = unquote(unquote(obj))
    return new_key

That's better.

...

def get_subdirectories(objects):
    """Retrieve all subdirctories within directory."""
    return [item['Key'] for item in objects['Contents']]
    

# List subdirectories of all fetched objects
objects = get_all_objects()
get_subdirectories(objects)

Check out list_objects_v2() this time. We restrict listing objects to the directory we want: posts/. By further specifying Delimiter='/', we're asking for folders to be returned only. This gives us a nice list of folders to walk through, one by one.

Shit's About to go Down

We're about to get complex here and we haven't even created an entry point yet. Here's the deal below:

  • get_folders() gets us all folders within the base directory we're interested in.
  • For each folder, we loop through the contents of each folder via the get_objects_in_folder() function.
  • Because Boto3 can be janky, we need to format the string coming back to us as "keys", also know as the "absolute paths to each object". We use the unquote feature in sanitize_object_key() quite often to fix this and return workable file paths.
from os import getenv
import json
import boto3
from botocore.client import Config
from botocore
from urllib.parse import unquote


# Initialize a session using DigitalOcean Spaces.
session = boto3.session.Session()
client = session.client(
    's3',
    region_name='nyc3',
    endpoint_url='https://nyc3.digitaloceanspaces.com',
    aws_access_key_id=getenv('KEY'),
    aws_secret_access_key=getenv('SECRET')
)


def sanitize_object_key(obj):
    """Replace character encodings with actual characters."""
    return unquote(unquote(obj))


def get_subdirectories(objects):
    """Retrieve all subdirctories within directory."""
    return [item['Key'] for item in objects['Contents']]


def get_all_objects():
    """Recursively fetch folders & objects in directory."""
    return client.list_objects_v2(
        Bucket='hackers',
        Delimiter='',
        EncodingType='url',
        MaxKeys=1000,
        Prefix='/',
        ContinuationToken='',
        FetchOwner=False,
        StartAfter=''
      )

RECAP

All of this until now has been neatly assembled groundwork. Now that we have the power to quickly and predictably loop through every file we want, we can finally start to fuck some shit up.

Choose Your Own Adventure

Purge Files We Know Are Trash

This is an easy one. Surely your buckets get bloated with unused garbage over time... in my example, I somehow managed to upload a bunch of duplicate images from my Dropbox, all with the suffix (Todds-MacBook-Pro.local's conflicted copy YYYY-MM-DD). Things like that can be purged easily:

...

def purge_objects(objects):
    """Delete unwanted objects from bucket."""
    banned = ['Todds-iMac', 'conflicted', 'Lynx']
    if any(x in item for x in banned):
        client.delete_object(Bucket="hackers", Key=item)
        return True
    return False

Fetch Objects from S3

If we want to apply certain image transformations, it could be a good idea to back up everything in our CDN locally. This will save all objects in our CDN to a relative path which matches the folder hierarchy of our CDN; the only catch is we need to make sure those folders exist prior to running the script:

...
from botocore.exceptions import ClientError


def fetch_object(object, bucket_name: str):
    """Download file from an S3 bucket."""
    try:
        client.download_file(
            Key=object,
            Filename=object,
            Bucket=bucket_name
        )
        print(f"Fetched object {object} from {bucket_name}")
    except ClientError as e:
        raise e
    except Exception as e:
        raise e

Upload Files to S3

After modifying our images locally, we'll need to upload the new images to our CDN:

...

def upload_object(object, new_filename: str, bucket_name: str):
    """Upload file to S3 bucket."""
    try:
        client.upload_file(
            Filename=new_filename,
            Bucket=bucket_name,
            Key=new_filename
        )
        print(f"Uploaded object {object} to {bucket_name}")
    except ClientError as e:
        raise e
    except Exception as e:
        raise e

Put It All Together

That should be enough to get your imagination running wild. What does all of this look like together?:

from os import getenv
import json
import boto3
from botocore.client import Config
from botocore
from urllib.parse import unquote


# Initialize a session using DigitalOcean Spaces.
session = boto3.session.Session()
client = session.client(
    's3',
    region_name='nyc3',
    endpoint_url='https://nyc3.digitaloceanspaces.com',
    aws_access_key_id=getenv('KEY'),
    aws_secret_access_key=getenv('SECRET')
)


def sanitize_object_key(obj):
    """Replace character encodings with actual characters."""
    return unquote(unquote(obj))


def get_subdirectories(objects):
    """Retrieve all subdirctories within directory."""
    return [item['Key'] for item in objects['Contents']]


def get_all_objects():
    """Recursively fetch folders & objects in directory."""
    return client.list_objects_v2(
        Bucket='hackers',
        Delimiter='',
        EncodingType='url',
        MaxKeys=1000,
        Prefix='/',
        ContinuationToken='',
        FetchOwner=False,
        StartAfter=''
      )
      

def fetch_object(object, bucket_name: str):
    """Download file from an S3 bucket."""
    try:
        client.download_file(
            Key=object,
            Filename=object,
            Bucket=bucket_name
        )
        print(f"Fetched object {object} from {bucket_name}")
    except ClientError as e:
        raise e
    except Exception as e:
        raise e
        
        
def upload_object(object, new_filename: str, bucket_name: str):
    """Upload file to S3 bucket."""
    try:
        client.upload_file(
            Filename=new_filename,
            Bucket=bucket_name,
            Key=new_filename
        )
        print(f"Uploaded object {object} to {bucket_name}")
    except ClientError as e:
        raise e
    except Exception as e:
        raise e
        
        
def purge_object(object):
    """Delete unwanted objects from bucket."""
    banned = ['Todds-iMac', 'conflicted', 'Lynx']
    if any(x in object for x in banned):
        client.delete_object(
            Bucket="hackers",
            Key=item
        )
        return True
    return False

Well that's a doozy.

If you feel like getting creative, there's even more you can do to optimize the assets in your bucket or CDN. For example: grabbing each image and rewriting the file in WebP format. I'll let you figure that one out on your own.

The source for this can be found on Github here:

Source code for tutorial found here: https://hackersandslackers.com/manage-s3-assests-with-boto3-python-sdk/
Source code for tutorial found here: https://hackersandslackers.com/manage-s3-assests-with-boto3-python-sdk/ - boto3_optimize_cdn.py
PythonArchitectureAWSDevOps

Todd Birchard

Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.