You may ask why there is a need to generate a hash value to compare records when comparing fields would suffice. The one reason I would say is to help with performance. I’m coming from a data warehousing perspective on this because there may be a need to switch the process of where row-by-row processing happens; this is more evident when processing 100k records where type 1 or 2 attributes occur. Or if you just want to read this blog, that’s cool too.
What’s a Hash Value?
First of all, what is a hash value? Microsoft defines a hash value as ‘…a numeric value of a fixed length that uniquely identifies data’. In a nutshell, it’s a string (hash value) that defines some data (like a string letters). To generate a hash value, a hash function or algorithm is used.
A hash function or algorithm defines the process of how a hash value is created. Algorithm examples are MD2, MD5, SHA, and SHA_256.
A hash value can vary in the way it looks, depending on the algorithm you use. For example, the word “Hello World” has a value of “a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e” using the SHA-256 algorithm. The MD5 algorithm outputs “b10a8db164e0754105b7a99be72e3fe5”
The T-SQL function used in the examples below will be HASHBYTES(). It’s important to note that there are deprecated, but still working, algorithms used: MD2, MD4, MD5, SHA, and SHA1. The only recommended ones from Microsoft are SHA2_256 and SHA2_512.
Compare Examples
In the examples below, I am comparing the same customer from stage (dbo.stgTblCustomer) and target (dbo.tblCustomer) tables. Rather than updating the data per each example, I did transformations inline to help you understand what is going on. I also know the result images are really small, I apologize. I think it’s my WordPress Theme that’s making them do that.
Simple Compare
In the code above, I want to show a simple difference between the two records based on their hash value. I use the HASHBYTES() function by specifying the algorithm I want to use, then concatenating VARCHAR(100) casted fields (with pipe delimiters). It’s important to remember that the fields you are comparing need to be the same in each query. You can see the generated hash values are different because Gender and SSN values are different between the two records. You might be asking why the pipes – that’ll be explained later; for now, just go with it.
Simple Compare w/ Transformation
Now let’s transform the Gender to an ‘F’, remove dashes from the SSN, and then check the hash values again.
We can see in the example above that the hash values are now the same because the columns we are comparing are the same. While it may be a hassle to add all the columns you want to compare in the query, you only need to compare the single hash value field for differences.
Why Pipe Delimiters?
Here is your answer to why the pipe delimiters. The reason pipe delimiters are important is because it separates data from column to column and compares it with more accuracy. To better explain this, take a look at the example below.
Say the last letter in ‘Evonne‘ was transposed to the last name, so we have ‘Evonn’ (first name) ‘eGiordin’ (last name). Using the function without the pipes would compare (specifically looking at first and last name) ‘EvonneGiordin’ to ‘EvonneGiordin’. Whereas using pipes would compare ‘Evonn|eGiordin’ to ‘Evonne|Giordin’.
Hopefully that’s a good intro into using hash values to compare data. Happy coding!