AWS S3 Excel with Python

When spreadsheets become even more annoying


Published on: March 17, 2020

Lessons from my work project

Background: you have an Excel workbook in AWS S3. And you will need a lambda function to read the workbook data and process it. Leave it to the business people who still want to use a spreadsheet as a DATABASE and making your life a little more interesting…

Do the necessary imports:

import boto3
import io
import pandas as pd # way 1
from openpyxl import load_workbook # way 2

Reading from S3 Bucket

s3 = boto3.client('s3')
s3_data = s3.get_object(Bucket=<s3 name>, Key=<s3 filename>)
contents = s3_data['Body'].read() # your Excel's essence, pretty much a stream

Way 1: With pandas

df = pd.read_excel(io.BytesIO(contents), encoding='utf-8', sheet_name="<my sheet name for workbook>")
for idx, row in df.iterrows():
   print(row['<column 1 name>'], row['<column 2 name>'], etc.)

# get a row, all columns:
df.loc[<row index>, :]

# get a cell
df.loc[<row index>, '<column name>']

Way 2: With openpyxl

# Read in data_only mode to parse Excel after all formulae evaluated
wb = load_workbook(filename=(io.BytesIO(contents)), data_only=True)
ws = wb['<my sheet name for workbook>']

# get a cell

ws[<row index>, ‘<column index>’].value


# a lil more complex, loop through and check if we have rows with strikethrough cells
for idx, row in enumerate(ws):
    for cell in row:
        if cell.value:
            value = cell.value # Note, empty cells have a value of None
            
            if cell.font.strike:
                invalidRows.add(idx)
                continue
            print(idx, cell, value)
    #       print(cell.font) # See more attributes
print(invalidRows)

wb.close()

Sources