Companies usually have two ways of handling their data processes; either batch or real time (or near real time) systems. In this blog, I’ll be focused on explaining what batch processing is. So, what exactly is batch processing?
Batch processing is defined by Wikipedia as “…execution of a series of jobs in a program on a computer without manual intervention”. To oversimplify, it’s a bunch of jobs (or tasks) that automatically run one after another. Each job has a responsibility to “do something”, whether that be to move files, import data, or export data, etc. Batches are often scheduled to run during the middle of the night or non-business hours to allow systems to be up and used without disruption. When all the jobs have completed successfully, the batch is said to be complete.
A batch system can be homemade within a company, or done using third-party programs. Examples of programs [that can be] used to handle batch systems are the Windows Task Scheduler, VisualCron, and Control-M. If you were going through the process to set up a job to check if a file exists, here is probably the process you would do to create a job to :
- Create some code that will validate if a file exists or not. Have it return a bit to determine true or false.
- Create an executable to run the code like a .cmd file.
- Create a new job/task (depending on your program) to run the executable.
- Determine what time it will run, if it needs to run before or after any other job already created, the frequency, and what to do if it completes successfully or not.
- An example of what to do if it completes successfully or not is if you want it to run another job or stop the entire batch process
- Schedule the job to run
- And you’re done!
To help explain further, let’s take a fake insurance company as an example.
Insurance company Johnopoly employs agents to sell insurance policies to the amazing citizens of Johnopolis. Whenever an agent needs to update information on a policy, they can enter the information into a user interface (UI) application. The application will then send the information to be loaded into a database. The data will be staged until batch processing runs to move data into the main system. Until batch runs, the data viewed on the screen is potentially not up to date. It isn’t until batch completes that when a user views the data (online or application), they will now see correct data. Here is a simple flowchart:
- Create some code to load data from the staging area into the Policy, Contact, and Reporting systems.
- Create an executable file to will execute my code like LoadSystem.cmd.
- Create a job/task in Control-M to run LoadSystem.cmd to run every weeknight at 6:00 PM.
- If it completes successfully, do nothing. If it fails, I want it stop batch and have an email notify me to check out what’s wrong.
- Schedule it to run
- Now I don’t have to worry about manually loading data from staging into the main systems anymore!
Why do you have to wait to move data into main systems?
The reason to have a wait period to process data into main systems (like Policy, Contact, and Reporting in the example above) is because it could lead to downtime for the users that utilize those systems. If the system was processing data, it could take up too many resources to be used efficiently. It’s like when you have too many applications open on your computer and they’re all sucking up your CPU.
As a side note, batch process is also dependent upon system design, but system design itself is for another day. Hopefully that helps push you in the right direction of understanding batch.