Digging Deep Into Compression. Because file compressibility is intimately related to degree of repetitiousness in data (which differs from language to language and even author to author within languages) researchers have shown that measurement of compressibility with common utilities such as WinZip or StuffIt can “discern the language of mystery texts as short as 20 characters. Furthermore, using a database of 90 texts from 11 different authors, they found their method could even pick out individual authors with a success rate of 93 percent.
Search engines, they say, could use this simple technique to categorize their quarry by semantic content and more qualitative categories such as style and readership level.” Wired
