Subscribe For Updates
Lessons from my work project
Background: you have an Excel workbook in AWS S3. And you will need a lambda function to read the workbook data and process it. Leave it to the business people who still want to use a spreadsheet as a DATABASE and making your life a little more interesting…
Do the necessary imports:
import boto3
import io
import pandas as pd # way 1
from openpyxl import load_workbook # way 2
Reading from S3 Bucket
s3 = boto3.client('s3')
s3_data = s3.get_object(Bucket=<s3 name>, Key=<s3 filename>)
contents = s3_data['Body'].read() # your Excel's essence, pretty much a stream
Way 1: With pandas
df = pd.read_excel(io.BytesIO(contents), encoding='utf-8', sheet_name="<my sheet name for workbook>")
for idx, row in df.iterrows():
print(row['<column 1 name>'], row['<column 2 name>'], etc.)
# get a row, all columns:
df.loc[<row index>, :]
# get a cell
df.loc[<row index>, '<column name>']
Way 2: With openpyxl
# Read in data_only mode to parse Excel after all formulae evaluated
wb = load_workbook(filename=(io.BytesIO(contents)), data_only=True)
ws = wb['<my sheet name for workbook>']
# get a cell
ws[<row index>, ‘<column index>’].value
# a lil more complex, loop through and check if we have rows with strikethrough cells
for idx, row in enumerate(ws):
for cell in row:
if cell.value:
value = cell.value # Note, empty cells have a value of None
if cell.font.strike:
invalidRows.add(idx)
continue
print(idx, cell, value)
# print(cell.font) # See more attributes
print(invalidRows)
wb.close()