ifni.co | CONCEPTS | REFERENCES | tweet | ABOUT |

We all know the answer to the question :

- What is the meaning of life, universe and everything ?

- 42.

... right ! Douglas Adams told us not so long ago.

How did he knew... of course it was a joke, still ... it is fun to ponder :)

Can we find some contrived way to prove that ? You are lucky I found one...

Even if you don't find the example compelling, because it is really out of ... somewhere :), still I'm using it as excuse so we can have some fun exploring the basics of Information theory...

“Information is a difference that makes a difference.” --Gregory Bateson

Wow !!! ... deep.The central tenet and goal of

We will first explore the problem of measuring the amount of information content in a message.

To shorten the article lets accept that the most basic way to encode information is bits. (That is true anyway we would not go trough the steps to show it.)

(The word bit comes from the shorthand of

A bit represent at most two possible choices

The good thing is this is very intuitive concept. If you just learned something new, then you didn't know about it beforehand, ergo your knowledge changed by that "difference", which we will call for now

1 | 000 | ||

2 | 001 | ||

3 | 010 | ||

4 | 100 | ||

5 | 011 | ||

6 | 110 | ||

7 | 101 | ||

8 | 111 | ||

Our question then is what is the information content of the things we learn ? Glad you asked :) .... Lets go with the easiest case by using

That is nice, but in reality we very rarely have such a simple needs i.e. exchange only 2 type of messages (symbols). Let say for example we want to communicate

What then will be the information content of messages using

We would expect that summing

Don't despair ;) there is very elegant solution to our problem, if we decide to represent information content as a logarithm of the possible symbols, rather than of the enumeration of all possible permutations. We can cheat this way because we are still in exploration and definition phase :). So what are the consequences if we follow this trough ...

If we accept logarithm definition instead of the enumeration, then the information content of 1 bit will be :

`log_2(2) = 1` unit

and for the
`log_2 (2*2*2) = log_2(2) + log_2(2) + log_2(2) = 3` units

We picked logarithm of base two, because it suits our needs best for binary like system ... we can pick any other we wish, but let stick with it for now.

The guiding principle is to preserve the summing-capability we just explored, so we can do easy calculation going from simple to more complex encodings.

We also made an assumption so far as we are sending messages back and forth on average the probability of getting any one symbol in a message out of

But in real life that is not always true i.e. some symbols are used more often than others and this changes the balance of the information content that can be transmitted.

Remember from a layman perspective the more surprising the message is to us the more information it contains.

What will happen if we want to encode messages using

Or what if we still use 8 symbols, but the symbol

To get another angle of how to handle those cases in our newly invented scheme, let get back to the case of 1 bit symbol. For example when flipping a unfair coin, over and over we find out that we get head(1) 9 out 10 times, and tail(0) 1 out of 10 times. Now we will take into account this unbalance by using the probabilities (in current case based on frequencies of event occurring):

` 90% * log_2(90%) + 10% * log_2(10%) ~= 0.9 * log_2(0.9) + 0.1 * log_2(0.1) = -0.4689`

the information content will then be

BTW we will also use

The formula we just used above is the following one :

`H = sum_{i=1}^n p_i * log_2(1/p_i)`
where :
`H` - is information content/Entropy
`p_i` - probability of encountering symbol "i"
`n` - number of possible messages using those symbols

Which is the same thing that we used before, just using probability.

Let's try it for the 8 bit case (the probability of a message happening is 1/8, if everyone of them have equal probability) :

`sum_{i=1}^8 1/8 * log_2(1/(1/8)) = 8 * 1/8 * log_2(8) = log_2(8) = 3` bits

So far so good, it seems like we found a way to quantify information-content, now we can further explore how to manage, manipulate and transport information. Our speculations are starting to look more and more like a Science.

The next step is to explore the process of exchanging information.

As we saw if we use n-bits to define a message in binary system there is

Striking the right balance is the major goal of Information theory. In this short article we wont go so deep, our main goal here is to get good understanding of Information content/Entropy.

OK, now that we know how to measure Information ... lets proceed to the answer of the

We will take a look at the English language.

The Latin alphabet contains 26 characters, plus space i.e. 27. Knowing this we can find the maximum entropy : (Maximum entropy happens when there is equal probability of using any character/symbol to devise a word, if we assume we have full freedom in combining characters)

`sum_{i=1}_27 1/27 * log_2(1/(1/27))`
or just :
`log_2(27) = 4.7548`
so the information content is : ` ~ 4.76` bits

That is the maximum information content we can expect, how can we find the real one ... we are not free to use all possible combinations when building words, English language imposes a constrains on us. First approximation is to use the frequency occurrence of the characters in English text. If we did that and used the probability equation for Entropy we devised earlier, we would find out that the answer is approximately

We can go to the next level i.e. using words frequencies build per word entropy then divide it by the average length of word in English (

So the maximum Entropy of English language per letter is

The drawback of this scenario is if we don't know the alphabet, or we don't know the word-codes we can not extract meaningful information from a message. So If we talk about machines exchanging information they both have to have the tables with codes, or the same algorithm have to be known on both sides (if we use algorithm to encode the messages).

We can translate this example to human parlance by comparing different languages, let speculate that Chinese hieroglyphs provide more information than English alphabet. If Chinese is more "information rich", why don't we use it then. The reason is that to use it you have to learn it first ;). Or vs. versa

So this is another problem that is tackled by Information theory, how to find the best code with the least complexity of the code. This is called

But let's go back to our original problem, English language ....

Now that we have the range of Entropy what can we do with it ? For one if we calculate this as a ratio we could find how much relative freedom of expression is allowed in the English language.

Let's do that :

`2/4.76 = 0.42` choice

`1 - 0.42 = 0.58` redundancy

`1 - 0.42 = 0.58` redundancy

For example let see how much more information we can exchange when we use binary 0|1 compared to decimal system 0-9. We know the formula already :

`log_2(2) = 1`
`log_2(10) = 3.32`

So the decimal notation is more than three times information rich.
The more recent tests shows that the Entropy of English language lower band may be 1.5 bits.. or lower the longer the average word in English language is, but I will assume that the acquisition in the last 50 years of many new words from Information age, Internet and globalization, any new word would add higher unpredictability of guessing the next character in building words.

The important thing that we learned in this short article is how to measure the Information content. If we know the frequency/probability of the symbols used to encode the possible messages generated by "any" source we just sum the products of the probability multiplied by the logarithm of probability of the possible symbols.

`H = sum_{i=1}^n p_i * log_2(1/p_i)`

We also learned that this is also the so called Entropy.
What other use this could be for us ? We now know how to find the physical limits on what is the maximum information that can be encoded given the number of symbols and their probabilities.

Based on this insight Information theory goes further to explore the capabilities of information channels and the ways to encode information, so in real life we can build reliable communication systems.

The same thing the Carno-cycle give to thermodynamics, Information theory gave to Communication.

- An Introduction to Information Theory: Symbols, Signals and Noise - John R. Pierce
- Great ideas in Information theory, Language and Cybernetics - Jagjit Singh

HOME | TOC | CONCEPTS | REFERENCES | ABOUT |