I’ve been trying to expand my ETL repertoire by utilizing other tools. In this post, I’ll go over at a high-level of how I replaced an old SSIS processes with Python; and yes, it’s a simple process, but you have to start somewhere right?
End Goal
The original process takes an Excel file, parses it into a SQL Server database, then extracts that data to a fixed-length file. The data in this post mimics the data I used in the original process – so don’t worry, I’m still compliant. The process is culmination of years of having no business rules and “on-the-fly” processing. While the overall idea is still simple, I think an SSIS package is overly complex and bulky for what needs to be done. So, the end goal is to parse an Excel file, transform the data, and then write it to the fixed-length file – only with Python.
Source
Right off the bat, there are a few things I didn’t like about the data and I know that I need to complete the following items:
- Make sure missing values don’t come through as nan or NULL
- Change date_issued to a MMDDYYYY format
- I need to strip the dollar sign from amount_1 and make sure it’s a XX.XX format
- Format all columns to have no underscore to make it easier for coding
- Add a new formtype column: If amount_1 > 2000.00 then A else N1
- Data is left-aligned in output
- Data columns fit in their specified positions
Python Code
Instead of going through the iterations of how the script turned out and each time I found a problem, I’m going to show the final script and comment where each problem was resolved by the number. I know it’s an ugly look in the box on this post, but you can copy it to notepad to make it legible:
import pandas as pd
import sys, getopt, pyodbc
print("starting program...")
df = pd.read_excel("MOCK_DATA.xlsx", sheet_name="data")
df.columns = df.columns.str.strip().str.upper().str.replace(' ', '').str.replace('_', '') # NUMBER 4
output = open("output.txt", "w+")
counter = 0
for index, row in df.iterrows():
output.write( "%-36s%-20s%-8s%-10s%-50s%-30s%-2s%-5s%-2s%-0s" % # NUMBER 6
(str(row.ID)[:36],
str(row.VENDORNAME)[:20],
str("0" + str(row.DATEISSUED).split('/')[0])[-2:] + str("0" + str(row.DATEISSUED).split('/')[1])[-2:] + str(row.DATEISSUED).split('/')[2], # NUMBER 2
str('{:.2f}'.format(float(str(row.AMOUNT1).replace('$', ''))))[:10], # NUMBER 3
str(row.ADDRESS).replace('nan', '')[:50], # NUMBER 1
str(row.CITY)[:30],
str(row.STATE).replace('nan', '')[:2], # NUMBER 1
str(row.POSTALCODE)[:5],
str("A" if float(str(row.AMOUNT1).replace('$', '')) >= 2000.00 else "N1"), # NUMBER 5
"\n"))
counter += 1
print("wrote {0} lines...".format(counter))
output.close()
print("program completed...")
Comparison
While I unfortunately don’t have images from the SSIS package, my assumption is that you can envision one. Now let’s compare the differences in creating and managing an SSIS package and a Python script.
Initial Creation
With an SSIS package, we needed to do the following:
- Create a database table that will be imported into
- Create an SSIS package which is mapped to the Excel file
- Create connections to the Excel file, output file, and the database table
- Create a SQL script to pull from the database table
With a Python script:
- Load the Excel file into a pandas dataframe
- Write to the output while simultaneously transforming the data
Metadata Changes
Whenever we have metadata changes such as a different format for a column or a change in length of the output, the Python script is easier to make those changes. With an SSIS package, you’ll need to alter the connection manager and it can become tedious to make sure fields are appropriately set. Often, it’ll require one to go into advanced settings or it may be easier to start from scratch to get the correct output.
Conclusion
While the original SSIS process isn’t as complex as others, I think it’s a good stepping stone to open up the doors to what could be. It was a fun exercise and I’m looking forward to using more Python in the future.