Basics

This page is a must read. If you don't read it, give up now or go and write thousands of lines buggy Python.

You have to understand data types.

Binary Representation

All numeric data, in fact all data, is represented as a series of zeros and ones. The sequence is linear and causes certain problems when it meets the real world. Normally, we count in base 10, from 1 to 10. This is actually a fallacy because we are really counting from 0 to 9 and the 10 turns up due to overflow, which will be discussed later.

Binary presents either a 0 or a 1, so doing addition you may immediately encounter overflow . Counting in binary starts at zero and the strings of 0's and 1's just get longer and longer:

O 1 1O 11 100 101 110 111 100O 10O1 1010

This may not appear obvious, but adding binary 1 + 1 you get overflow, which gives you the result 10, in binary. This is similar to doing the decimal addition of 9 + 1; it won't fit into one character so you get overflow, giving you ten [10 in decimal].

Integers

These come in two types: unsigned and signed. The former represents numbers in the range 0 to N while the latter represents numbers in the range -N to N - 1 (where N is some arbitrary number).

This all comes back to binary arithmetic:

0 is always taken.
0 to N are taken, in the unsigned case.
-N to N - 1 are taken, in the signed case because one bit is used to indicate that the integer is negative.

These three rules apply to any binary integer, irrespective of the number of bits used as long as there is a spare bit in the signed case (eg. -1 doesn't exist when you only have 1 bit to play with).

Integers come in various sizes: 8, 16, 32 and 64 bits (usually powers of 2), whether they are unsigned or unsigned .

Various architectures have used integers of strange sizes [36 bits], but most of those are now dead, or should be.

Unsigned Integers

These are quite easy as they represent values from 0 to N (where N is positive). N + 1 overflows to 0.

Signed Integers

These are more complex as they represent both positive and negative values [from -N to N - 1, where N is any positive integer]. To understand these requires knowledge of one's and two's complement representataions.

Neither of these is very difficult and the following example should suffice (we shall assume an 8 bit binary representation):

000000001 - the value 1

111111110 - the value 1 [one's complement]

111111111 - the value -1 [two's complement]

The process is quite simple.

From a binary value to one's complement the bits are flipped . Flipping turns zeroes into ones and ones into zeros. This is a common operation in machine code and some 3rd generation languages, say C.

From one's complement to two's complement 1 is added. At this point you have a signed [negative] integer. The leftmost bit being 1 indicates that the number is negative. Overflow here becomes a problem because a very large negative number can be be turned in 0 or any positive number < N or even some negative number <= -N.

An 8 bit number was chosen as an example because many machines do not natively support signed 8 bit numbers.

Sign Extension

Many machine architectures preserve the sign when the value is promoted to a larger bit value. Some do not and some of this is hidden by the the programming language being used.

Floating Point

Floating point is nasty. It was explained once by a pure maths lecturer as:

Floating point is used for measuring beer.

The goal is to represent uncountable objects [beer], to a reasonable precision, using a countable object [9mm rounds] method. Such a method falls to the manipulation of integer values, but in an interesting way.

Of course, this may sound like madness, but it is simple in principle .

The integer is split into three pieces:

The sign bit.
The mantissa, which is a collection of bits representing the value.
The exponent, which says where the decimal point goes; it's really a multiplier.

The real problems start because the representation is an approximation and is further complicated by:

overflow
underflow
roundoff
comparisions
NaN [Not A Number]

Because of the above problems all sorts of caveats should be thought about. The worst are probably roundoff and comparison. Comparison causes problems because two approximate values are being compared and heuristics are commonly used to resolve this problem, depending on the calcuation involved.

Further reading is encouraged if floating point is to be understood fully.

Strings

Strings are used to represent and manipulate text.

No 3rd generation language should be used unless it has strings as a basic type, except in special cases.

This cuts down the choice of 3rd generation languages to a few. Limbo and Python both have strings as basic types. This removes all the woe involved in string allocation, garbage collection and overflows that have provoked more than one security problem.

Limbo is a better choice because the set of operators is small and well designed. However, Python seems to to designed from the point of view that every possible string operator should be added. This style of thinking just adds complexity and complexity leads to bugs.

Pascal has strings, but they are unusable in practice. A (short) defence of such a statement is why are there numerous versions of Pascal to get strings right?

C does support strings. It views them as arrays of nul terminated 8 bit integers. If you count wrong, you lose. The use of C's sizeof operator can be beneficial but gives no guarantees. Some good C programmers write infinite string libraries. However, most don't and they assume that any given array of 8 bit integers will suffice. They don't.

Astract Data Types

These are usually built with the aforementioned data types and can be considered as one entity. Integer, floating point and string data types can be mixed to make an arbitrarily complex data type, each ocuupying its own unique memory. They can be nested and inherited depending on the programming language.

Inherited data types and inheritence should be avoided, unless wading through a gruesome mess is one of your pastimes.

A curious form of an abstract data type is called a union [in C] where two different data types occupy the same memory, but comprise the the same entity.

Enumerated Types

They are named constants, usually having a signed integer value. Use of such them is encouraged to avoid literal integers in the code. Pick a name, assign it a value, use it everywhere.

Should you need to change its value it only needs to be done in one place and the effects will be global.

Enums [the pluralised name from C] can be used as union discriminators and constants. If you are writing C the preprocessor should be avoided, in favour of enums or other constants.

Summary

Understand the basics.
Apply that knowledge to your advantage.
Spend time investing in a clean set of data structures.
It will pay off in the long term.

Your comments are welcome and if you liked this page you can donate via: