Wednesday, March 29, 2006

Data compression

Some days ago I was here at work and I was trying to access Nortel's internal network using a VPN. It was not quite working and I was trying to fix the problem when I realized the type of compression the software was using...

During my days of studies in Sweden I took a course called Data Compression where I had the chance of implementing and testing several data compression algorithms and compare their performance. What is that data compression about? This term is usually referred as the process of encoding information using fewer bits... that means we try to find a way of sending information that requires less use of resources, being able to transmit more data in the same available space.

If you use a computer frequently, you might be dealing with data compression all the time even without knowing it. One compression format that we use very often is the well known "ZIP" file format. Even all the pictures, videos and music we have stored in our hard drive use a compression technique, for example: JPEG, MPEG, MP3, WMA, GIF...

My VPN was using a LZ(Lempel-Ziv) encoding algorithm (LZW to be specific). This type of compression falls under the category of lossless data compression (there is lossy data compression as well, frequently used in Internet and telephony) and if you really want to know where it comes from and what the basis of the algorithm is, you can take a look at this.

How does LZW work? LZW is a way of compressing data that takes advantage of repetition of strings in the data. LZW manipulates three objects in both compression and decompression: the input text, the output stream of bits, and a string table. Such string table is a product of both compression and decompression, but is never passed from the encoder to the decoder.

Lempel-Ziv compressors use a dictionary of symbol sequences (string table). When an occurrence of the sequence is repeated it is replaced by a reference to its position in the dictionary.

No doubt this compression technique is much better understood with an example and I found a very good one here.

If you try to do the same thing with this example PPPQPPQQQ, the whole string will be compressed to 0 2 1 3 1 1

Originally, the size of the string PPPQPPQQQ is 9 x 8 bits (size of ASCII) = 72 bits. For the compressed string, if we define the code size as 8 bits as well, its size be 6 x 8 bits = 48 bits. The compressed ratio is 1.5 in this case.

If anyone is really interested on this, I have the files implemented in Matlab for both the encoder and the decoder (and for some other compression algorithms as well). Just give me a shout and I will send them to you...

2 Comments:

At 5:52 PM , Anonymous Anonymous said...

Siempre me ha fascinado lo de la compresión de datos (es parte de ser teleco). Así que sí, estoy muy interesado en conseguir los ficheros.

Pasado mañana vienen mis padres y estarán todo el fin de semana (coincidiendo con tu viaje a Dublin). Deben pagar muy bien en Nortel para que puedas pegarte todos esos viajazos, ¡cómo te envidio!

A ver si pillo ya un trabajo de los wenos, de muchos billetes, para poder viajar tantísimo o más que tú :P

Besos

 
At 2:12 PM , Anonymous Anonymous said...

That's an interesting observation while you were trying to fix Nortel's Internal Network problems which led to a 'flashback' and hence the blog. :-)

I am trying to write a compression algorithm, as part of a much bigger project and I have a question for you here. Do you have any idea about how should you determine the maximum size of the dictionary?

A growing or infinite dictionary would be a problem in my view. In other words, I do not know how to use an infinite dictionary.

Thanks for your time,

Sudarshan L S

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home