I’ve always stated that if I can’t explain topics at a basic level so everyone can understand, I probably don’t know it enough. Having said that, I wanted to provide a simple, concise, and quick overview of extract, transform, and loading (ETL) data.
Regardless of your job, there will always be a need for data. There will always be a need to extract that data, alter/clean/transform that data, and it will always need to go somewhere. There may even be some challenging times where you need to use multiple data sources, combine them, and then provide it to someone. At a basic level, this is an example of how you would approach the multiple (two in this case) source problem.
Scenario
Let’s say you run two separate grocery shops called G-Store 1 (GS1) and G-Store 2 (GS2), and they have their own way of reporting individual sales. As an example, GS1 decides to report a sale for each individual item, line-by-line like so:
Because GS2 likes to group items, they report of the same order would show like so:
The goal now is to combine the two reports and provide a single standard report. How would we go about that?
Extract
The first thing to do is determining how to extract the data. There are a number of ways that data can be extracted. For instance, you may have a third-party application that has an export process, you can use business intelligence (BI) tools like Alteryx or SQL Server Integration Services, or good ole’ copy and paste from a screen.
For our scenario, we’ll assume each store has a feature in the POS that will output daily sales into Excel spreadsheets by clicking a button. Here are the reports:
GS1 Sales
GS2 Sales
Transform
Now that we have our reports, the next step is to transform this data to a standard report. The first thing is to figure out what that report needs to look like. For simplicity, we’re going to say we want a report that tells us the breakdown of each item sold, the total number of items sold, and the total amount for the purchase date.
GS1 Sales Transformed:
GS2 Sales Transformed:
Combined Report:
Other Notes
In this example, all we did was group the data in GS1 to match GS2 reports and blend the two. There are other things that can happen during this phase such as cleansing, removing data, adding data, and changing information.
Load
The loading aspect in ETL is the final destination for your task. This could be an Excel file, csv file, tab-delimited file, a database, or again – copy and paste into an email. In this example, we’d paste the data from the Combined Report into a spreadsheet and send it off.
Though this may be the destination for this current task, this doesn’t necessarily mean that this data won’t go through a different ETL process for other purposes as well.
Tips
I showed you an overly simplified version of an ETL process that can all be done via Excel. When there are gigs worth of data, Excel probably won’t be able to manage all that and you’ll need to use more powerful ETL tools. One advice for any ETL process that is being developed, try to create a standard process workflow where you can easily run incoming data through. It will save you a ton of time. Another tip would be to flush out all requirements, questions, and caveats before you write a line of code or do anything. This will also save you time in two ways – you know exactly what is needed and that you may come to find you don’t need to do anything at all.
I hope this was helpful!