Hash Function

From GM-RKB
Jump to navigation Jump to search

A hash function is a software function that can maps variable length data to a fixed length key.



References

2020

  • (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/hash_function Retrieved:2020-5-5.
    • A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

      Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval, and storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space efficient form of data access which avoids the non-linear access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.

      Use of hash functions relies on statistical properties of key and function interaction: worst case behavior is intolerably bad with a vanishingly small probability, and average case behavior can be nearly optimal (minimal collisions). [1]

      Hash functions are related to (and often confused with) checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers. Although the concepts overlap to some extent, each one has its own uses and requirements and is designed and optimized differently.

2013a

2013b

  • http://en.wikipedia.org/wiki/Hash_function#Description
    • Hash functions are primarily used to generate fixed-length output data that acts as a shortened reference to the original data. This is useful when the output data is too cumbersome to use in its entirety.

      One practical use is a data structure called a hash table where the data is stored associatively. Searching for a person's name in a list is slow, but the hashed value can be used to store a reference to the original data and retrieve constant time (barring collisions). Another use is in cryptography, the science of encoding and safeguarding data. It is easy to generate hash values from input data and easy to verify that the data matches the hash, but hard to 'fake' a hash value to hide malicious data. This is the principle behind the Pretty Good Privacy algorithm for data validation.

      Hash functions are also used to accelerate table lookup or data comparison tasks such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

      A hash function should be referentially transparent (stable), i.e., if called twice on input that is "equal" (for example, strings that consist of the same sequence of characters), it should give the same result. There is a construct in many programming languages that allows the user to override equality and hash functions for an object: if two objects are equal, their hash codes must be the same. This is crucial to finding an element in a hash table quickly, because two of the same element would both hash to the same slot.

      Hash functions are destructive, akin to lossy compression, as the original data is lost when hashed. Unlike compression algorithms, where something resembling the original data can be decompressed from compressed data, the goal of a hash value is to uniquely identify a reference to the object so that it can be retrieved in its entirety. Unfortunately, all hash functions that map a larger set of data to a smaller set of data cause collisions. Such hash functions try to map the keys to the hash values as evenly as possible because collisions become more frequent as hash tables fill up. Thus, single-digit hash values are frequently restricted to 80% of the size of the table. Depending on the algorithm used, other properties may be required as well, such as double hashing and linear probing. Although the idea was conceived in the 1950s, the design of good hash functions is still a topic of active research.[2]

      Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error-correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. The HashKeeper database maintained by the American National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.

2010

  • http://www.partow.net/programming/hashfunctions/index.html
    • Hash functions are by definition and implementation pseudo random number generators (PRNG). From this generalization its generally accepted that the performance of hash functions and also comparisons between hash functions can be achieved by treating hash function as PRNGs.
    • Analysis techniques such a Poisson distribution can be used to analyze the collision rates of different hash functions for different groups of data. In general there is a theoretical hash function known as the perfect hash function for any group of data. The perfect hash function by definition states that no collisions will occur meaning no repeating hash values will arise from different elements of the group. In reality its very difficult to find a perfect hash function and the practical applications of perfect hashing and its variant minimal perfect hashing are quite limited. In practice it is generally recognized that a perfect hash function is the hash function that produces the least amount of collisions for a particular set of data.

  1. Knuth, D. 1973, The Art of Computer Science, Vol. 3, Sorting and Searching, p.527. Addison-Wesley, Reading, MA., United States
  2. Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching. pp. 506–542.