Python – Replacing an SSIS Process

I’ve been trying to expand my ETL repertoire by utilizing other tools. In this post, I’ll go over at a high-level of how I replaced an old SSIS processes with Python; and yes, it’s a simple process, but you have to start somewhere right?

End Goal

The original process takes an Excel file, parses it into a SQL Server database, then extracts that data to a fixed-length file. The data in this post mimics the data I used in the original process – so don’t worry, I’m still compliant. The process is culmination of years of having no business rules and “on-the-fly” processing. While the overall idea is still simple, I think an SSIS package is overly complex and bulky for what needs to be done. So, the end goal is to parse an Excel file, transform the data, and then write it to the fixed-length file – only with Python.

Source

Right off the bat, there are a few things I didn’t like about the data and I know that I need to complete the following items:

Make sure missing values don’t come through as nan or NULL
Change date_issued to a MMDDYYYY format
I need to strip the dollar sign from amount_1 and make sure it’s a XX.XX format
Format all columns to have no underscore to make it easier for coding
Add a new formtype column: If amount_1 > 2000.00 then A else N1
Data is left-aligned in output
Data columns fit in their specified positions

Python Code

Instead of going through the iterations of how the script turned out and each time I found a problem, I’m going to show the final script and comment where each problem was resolved by the number. I know it’s an ugly look in the box on this post, but you can copy it to notepad to make it legible:

import pandas as pd
import sys, getopt, pyodbc

print("starting program...")

df = pd.read_excel("MOCK_DATA.xlsx", sheet_name="data")
df.columns = df.columns.str.strip().str.upper().str.replace(' ', '').str.replace('_', '') # NUMBER 4
output = open("output.txt", "w+")

counter = 0
for index, row in df.iterrows():
        output.write(  "%-36s%-20s%-8s%-10s%-50s%-30s%-2s%-5s%-2s%-0s" % # NUMBER 6
                        (str(row.ID)[:36],
                        str(row.VENDORNAME)[:20],
                        str("0" + str(row.DATEISSUED).split('/')[0])[-2:] + str("0" + str(row.DATEISSUED).split('/')[1])[-2:] + str(row.DATEISSUED).split('/')[2], # NUMBER 2
                        str('{:.2f}'.format(float(str(row.AMOUNT1).replace('$', ''))))[:10], # NUMBER 3
                        str(row.ADDRESS).replace('nan', '')[:50], # NUMBER 1
                        str(row.CITY)[:30],
                        str(row.STATE).replace('nan', '')[:2],    # NUMBER 1
                        str(row.POSTALCODE)[:5],
                        str("A" if float(str(row.AMOUNT1).replace('$', '')) >= 2000.00 else "N1"), # NUMBER 5
                        "\n"))
        counter += 1

print("wrote {0} lines...".format(counter))
output.close()
print("program completed...")

Comparison

While I unfortunately don’t have images from the SSIS package, my assumption is that you can envision one. Now let’s compare the differences in creating and managing an SSIS package and a Python script.

Initial Creation

With an SSIS package, we needed to do the following:

Create a database table that will be imported into
Create an SSIS package which is mapped to the Excel file
Create connections to the Excel file, output file, and the database table
Create a SQL script to pull from the database table

With a Python script:

Load the Excel file into a pandas dataframe
Write to the output while simultaneously transforming the data

Metadata Changes

Whenever we have metadata changes such as a different format for a column or a change in length of the output, the Python script is easier to make those changes. With an SSIS package, you’ll need to alter the connection manager and it can become tedious to make sure fields are appropriately set. Often, it’ll require one to go into advanced settings or it may be easier to start from scratch to get the correct output.

Conclusion

While the original SSIS process isn’t as complex as others, I think it’s a good stepping stone to open up the doors to what could be. It was a fun exercise and I’m looking forward to using more Python in the future.