How Internet Security Works: Part I |

Have you ever wondered how your communications, and general data transfer over the internet, stays secure and confidential? It is actually quite an intricate process, and involves complex mathematics, cryptographic algorithms and identity-verification infrastructure that is constantly operating in the background.

In this series of blog posts, I would like to give a high-level overview of how internet security works. I emphasize that this is only a high-level and general overview, as an in-depth dive into this is a multidisciplinary endeavor that would require hundreds of pages, at the very least (and be, in truth, outside of my competence). Given that this is a mere introduction into internet security, for the sake of readability and accessibility I have simplified some concepts and omitted others.

Before we begin, it is necessary to differentiate between data in-transit and data at rest. Data in-transit denotes data as it is transiting the wire. Data at rest denotes data as it is staying at rest, as it were, on a storage medium (e.g., a HDD or SDD). Internet security consists of both data-in-transit security and data-at-rest security. Only the former is the subject of this blog post. I will cover how the security of data at rest works in a future post.

Now, suppose you would like to connect to Google. How does your communication with Google maintain the two security properties of confidentiality and integrity? That is, how does your communication to Google manage to only be read by Google (and whoever Google chooses to share your data with), and how is it insured that the data that you send to Google, e.g., your searches, have not been tampered with? Preserving the two security properties of confidentiality and integrity is the provenance of data-in-transit security.

Confidentiality there is maintained by cryptographic encryption algorithms, i.e., algorithms specifically made to encrypt data such that only parties with the relevant secret keys can decrypt the data. The algorithms standardly used on the internet were created by high-level cryptographers and mathematicans; they are open-source algorithms, meaning that they are free to use anywhere and that their source code is freely available to the public. Moreoever, these algorithms, like the famous Rijndael algorithm of Advanced Encryption Standard (AES) fame, almost certainly underwent extensive processes of peer review (like the sifting process undertaken by the National Institute of Standards and Technology, otherwise known as NIST).

Now, these encryption algorithms, otherwise known as ciphers, are fundamentally what maintain confidentiality when, e.g., I send data to Google. My data is encrypted before leaving my endpoint device and the encrypted data, known as a ciphertext, is sent to Google. So some ciphertext C is the result of running an encryption algorithm E on data M using a secret key k; this is standardly represented as $$C = E(M, k)$$ This process is called symmetric-key encryption, i.e., encryption where all parties have the same exact key k – hence the symmetry – and which enables the recipients to decrypt C. So far so good. But, you may be wondering, how would you go about sending your k to Google without it being compromised by a third-party intercepting the data being sent over the wire? If that third party were to get its hands on k, it could decrypt C. Decryption is standardly represented as $$M = D(C, k)$$

One idea of protecting k as it is transiting the wire is to also encrypt k itself using the aforementioned encryption method. So M would be substituted for k1, leaving $$C2 = E(k1, k2)$$ But this second-order encryption only pushes the issue back one step – what if an adversary, Eve, is sniffing the wire and intercepts k2? Then Eve would be able to use k2 to decrypt C2 and hence reveal the plain text k1. And using k1, the adversary would be able to decrypt C1. So passing the encryption buck in this way, if you will, is not an adequate solution.

Another solution would be to pass k1 through an out-of-band channel, meaning, through a channel or means outside of the channel through which M is passed. So, for example, if I send Google my M for a Google search over the internet, I could send my k to Google over phone or through mail. But the issue is that this is obviously not very practical or time-efficient.

What, then, is the solution? Well, in 1976 Whitfield Diffie and Martin Hellman published a watershed paper, New Directions in Cryptography, wherein they presented a brilliant algorithm that has now come to be called the Diffie-Hellman key exchange (DH).¹ DH algorithms have been ubiquitously adopted as the de facto solution for the aforementioned key-exchange problem. On a very abstract level, DH allows for parties to come to an agreement on the value of the symmetric key k over an insecure channel, and indeed the same channel that M is sent on.

But how? The answer involves clever mathematics. Basically, DH makes use of one-way functions in mathematics. Think of one-way functions as functions that are difficult to compute in one direction, but very hard to reverse. Now, the existence of such one-way functions hasn’t technically been theoretically proven; but a few one-way functions have been conjectured to exist based on decades of scrutiny by mathematicians. For example, it is computationally feasible to calculate the product of two very large prime numbers p and q, for any randomly chosen p and q;² however, it is conjectured that it is not computationally feasible to calculate the prime factors of pq; at the very least, no one has figured out a way to do this yet. In the case of DH, it relies not on the prime factorization problem (which a famous cryptogrpahic algorithm called RSA relies on), but what has come to be called the discrete logarithm problem (DLP). The following is a brief explanation of DLP:

Pick a large prime number $p$, pick a base $g$, and choose some secret number $a$. It is computationally feasible to compute:

$$ X \equiv g^a \pmod p $$

Even when $a$ is hundreds of digits long, modern computers can compute $g^a \bmod p$ quickly by a method called modular exponentiation.

Now suppose someone knows $g$, $p$ and $X$, but not $a$.

Recovering $a$ from the equation

$$ g^a \equiv X \pmod p $$

is called taking the discrete logarithm of $X$ base $g$ modulo $p$:

$$ a = \log_{g}(X) \pmod p $$

And there is, as of yet, no computationally feasible method to calculate $a$ here. It is considered a hard problem.

Now, let’s handwave at the mathematics here. For our purposes, what we need to know is that by making use of the aforementioned DLP problem, DH allows parties to come to an agreement as to the value of $k$. Importantly, the symmetric key agreed upon in Diffie–Hellman is as follows:

$$ k = g^{ab} \pmod p $$

So DH allows data like k to be “exchanged” over an insecure channel; I place “exchanged” in quotes because technically k is never transmitted over the wire, whether in encrypted or unencrypted form. Technically, what is exchanged is public information that, in conjunction with the parties’ private information (in this case private exponenets), allow the respective parties to come to an agreement regarding the value of k. DH is an instance of what is called asymmetric cryptography because there are public and private values, in contradistrinction to symmetric cryptography, where there are only shared private values or keys.

After the DH key agreement, symmetric encryption through the shared key k, as explained above, can be used to faciliate a secure connection between the relevant parties. Modern systems use DH, or, more accurately, modern variants like Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) or Diffie-Hellman Ephemeral (DHE), to initiate a secure key agreement, and then this symmetric key will thereafter be used to encrypt all the data in the session. In practice, such a hybrid cryptographic solution consisting of both symmetric and asymmetric cryptographic algorithms is ubiquitously implemented in the modern internet.

Okay, so that is how the security property of confidentiality is maintained over the the internet. But what about data integrity? How can I be sure that when I send data to Google, or to my bank, that that data will not be tampered with and is therefore the same exact data that I sent? The answer once again involves cryptography and clever mathematics, and specifically, the concept of a hash and a message authentication code (MAC).

At the heart of verifying data integrity in digital security is the concept of hashing. Hashing is the process of converting an input of any size (like a file, a password, or a sentence) into a unique string of characters called a hash value or message digest, using a mathematical algorithm known as a hash function (e.g., SHA-256). Good hashes have a number of important properties. A hash should be deterministic, meaning that a hash function applied to the same data will always produce the same output. Moreover, a hash should be computationally infeasible to reverse without knowing the original hashed value, often called the pre-image. In addition, it should exemplify what is called an “avalanche effect,” i.e., even a change to a single bit of data should cause an avalanche of changes to the resulting hash such that it will be dramatically different. Lastly, a good hash should, regardless of the input, have a fixed-length output. For example, data that is hashed using the SHA-256 algorithm will always have a fixed-length output of 256 bits, which can be represented as 64 hexadecimal characters. The following is an example of a SHA-256 hash of the string, “Hello World!”

7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069

Now, suppose I replace the exclamation mark with a period and apply the SHA-256 hash function on that. The resultant message digest would be as follows:

f4bb1975bf1f81f76ce824f7536c1e101a8060a632a52289d530a6f600d52c92

As you can see, the hashes are dramatically different, even though both hashes are 256 bits, and represented as 64 hexadecimal characters. This is by design, and makes it easy to recognize, even to the untrained human eye, when data has been tampered with. You can now see why hashes are extremely useful in internet security.

Take the following scenario: I have a message M that I want to send to Google. I hash M with SHA-256, producing a message digest D1. I then send M and D1, along with information about which hashing algorithm I used, to Google. Google, receiving the data, hashes M with SHA-256 and produces D2. It will then check to see whether the digest it received is equal to D2. If it matches, then that proves that the data was not tampered with in transit. The problem of retaining data integrity in transit has been solved, right? Well, not so fast. All this proves is that the message that arrived at Google was not tampered with; indeed, by comparing the message digests, Google has basically proven that the message it received was not tampered with. But Google has not proven that the message originated from me. That is, it has validated data integrity, but not data authenticity. For all we know, a third party, Eve, could have intercepted the message, modified M, and replaced D with a hash of modified M. In this case, all Google would have verified is that modified M was not tampered with in-transit. But that doesn’t prove that I sent modified M. In order to solve this authentication issue, more than just a hash function is necessary. Enter message authentication codes (MACs).

A message authentication code is used in symmetric encryption to provide not only integrity, but authenticity. In other words, a MAC verifies that a message originates from the other relevant party (with a caveat I will mention later). The most widely used type of MAC is called an HMAC, i.e., a hash-based message authentication code. It is a way to provide authentication without the use of asymmetric cryptography, which relies on digital signatures. A proper implementation of HMAC assumes that a secure key-exchange or agreement has already been reached. Because only Google and I should have access to K, K can be used as part of the hash input, and this will verify authenticity. So, you can think of an HMAC as a function that basically consists of hashing the concatentation of M and K:

$$ \text{HMAC}(M) = H(M \parallel K) $$

Very importantly, this should only be taken as a heuristic, as this implementation of an HMAC would technically be considered insecure and vulnerable to length-extension attacks.³ But as a model of how to think of the concept of an HMAC, this suffices.

Okay. So, in the wild wild West of the internet today, key agreement is done by variants of the Diffie-Hellman key-agreement algorithm, message integrity is verified by hashing algorithms, and message integrity and authenticity are verified by HMACs. Now, recall tat a secure implementation of HMAC assumes that k was securely agreed to or exchanged by the relevant parties. But, how would I know, before initiating a DH key-agreement, that the recipient party is in fact Google, and not Eve? After all, if the initial DH key-agreement is done with a malicious third party like Eve, then a successful HMAC implementation would only verify authenticity post DH key-agreement, but not prior to it. If Eve were to somehow impersonate Google prior to the DH key-agreement, then the whole aformentioned security tapestry would disintegrate and the communication would be compromised.

So, how can I trust that the initial DH key-agreement is being done with Google and not Eve? The answer lies in Public Key Infrastructure (PKI) and digital certificates, which make use of asymmetric cryptography. Recall that asymmetric cryptography makes use of public-private key pairs rather than a shared secret key. Internet PKI fundamentally utilizes asymmetric cryptography, a hiercharchy of certificates, intermediary certificate authorities and root certificate authorities to verify that when you send data to a web server hosting a particular website or domain, you can trust that you are communicating with the owner of that domain and not some third party. In part II, I will provide a general overview of the internet’s Public Key Infrastructure (PKI).

DH marked the first publicaly disclosed instance of an asymmetric cryptographic algorithm, i.e., an algorithm where there are public and private values. Prior to their paper, only symmetric algorithms making use of shared private values or keys were publically disclosed. It should be noted that although DH technically falls under asymmetric cryptography, it only makes use of asymmetric cryptography for key agreement, not encryption. ↩︎
In mathematics, a calculation is considered computationally feasible if it can be computed in polynomial time. ↩︎
An HMAC technically applies two hashes: an inner and outer hash, and is represented as follows: $$ \text{HMAC}(M) = H((K \oplus \text{opad}) \parallel H((K \oplus \text{ipad}) \parallel M)) $$ opad and ipad are constants that are parameterized by the hash function’s block size (e.g., SHA-256’s block size of 64 bytes). ipad is 0x36 multiplied by the block size, whereas opad is 0x5C multiplied by the block size. ↩︎