Home>Article>Java> Analysis of the implementation principle of HashMap in Java

Analysis of the implementation principle of HashMap in Java

不言
不言 Original
2018-09-11 13:57:06 1493browse

The content of this article is about the analysis of the implementation principle of HashMap in Java. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

1. HashMap overview:
HashMap is an asynchronous implementation of the Map interface based on hash tables. This implementation provides all optional mapping operations and allows null values and null keys. This class does not guarantee the order of the mapping, and in particular it does not guarantee that the order is immutable.
2. HashMap data structure:
In the Java programming language, there are two most basic structures, one is an array, and the other is a simulated pointer (reference). All data structures can be constructed using these two basic structures, and HashMap is no exception. HashMap is actually a "linked list hash" data structure, which is a combination of an array and a linked list.
As can be seen from the above figure, the bottom layer of HashMap is an array structure, and each item in the array is a linked list. When a new HashMap is created, an array will be initialized.

/** * The table, resized as necessary. Length MUST Always be a power of two. */ transient Entry[] table; static class Entry implements Map.Entry { final K key; V value; Entry next; final int hash; …… }

3. HashMap access implementation:
1) Storage:

public V put(K key, V value) { // HashMap允许存放null键和null值。 // 当key为null时,调用putForNullKey方法,将value放置在数组第一个位置。 if (key == null) return putForNullKey(value); // 根据key的keyCode重新计算hash值。 int hash = hash(key.hashCode()); // 搜索指定hash值在对应table中的索引。 int i = indexFor(hash, table.length); // 如果 i 索引处的 Entry 不为 null,通过循环不断遍历 e 元素的下一个元素。 for (Entry e = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } // 如果i索引处的Entry为null,表明此处还没有Entry。 modCount++; // 将key、value添加到i索引处。 addEntry(hash, key, value, i); return null; }

As can be seen from the above source code: when we put an element into the HashMap, we first recalculate the hash value based on the hashCode of the key, and get the element in the array based on the hash value. position (i.e. subscript), if there are other elements stored at this position in the array, then the elements at this position will be stored in the form of a linked list, with the newly added ones placed at the head of the chain, and the first added elements placed at the end of the chain. . If there is no element at that position in the array, the element is directly placed at that position in the array.
The addEntry(hash, key, value, i) method places the key-value pair at the i index of the array table based on the calculated hash value. addEntry is a package access method provided by HashMap. The code is as follows:

void addEntry(int hash, K key, V value, int bucketIndex) { // 获取指定 bucketIndex 索引处的 Entry Entry e = table[bucketIndex]; // 将新创建的 Entry 放入 bucketIndex 索引处,并让新的 Entry 指向原来的 Entry table[bucketIndex] = new Entry(hash, key, value, e); // 如果 Map 中的 key-value 对的数量超过了极限 if (size++ >= threshold) // 把 table 对象的长度扩充到原来的2倍。 resize(2 * table.length); }

When the system decides to store the key-value pair in HashMap, it does not consider the value in Entry at all, but only calculates and calculates based on the key. Determine the storage location of each Entry. We can completely regard the value in the Map collection as an accessory to the key. After the system determines the storage location of the key, the value can be stored there.

The hash(int h) method recalculates the hash based on the hashCode of the key. This algorithm adds high-bit calculation to prevent hash conflicts caused when the low-bit remains unchanged and the high-bit changes.

static int hash(int h) { h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

We can see that to find an element in a HashMap, we need to find the position in the corresponding array based on the hash value of the key. How to calculate this position is the hash algorithm. As mentioned before, the data structure of HashMap is a combination of array and linked list, so of course we hope that the element positions in this HashMap are distributed as evenly as possible, so that there is only one element in each position. Then when we use the hash algorithm to obtain At this position, we can immediately know that the element at the corresponding position is what we want, without having to traverse the linked list, which greatly optimizes the efficiency of the query.

For any given object, as long as its hashCode() return value is the same, the hash code value calculated by the program calling the hash(int h) method is always the same. The first thing we think of is to take the hash value modulo the array length, so that the distribution of elements is relatively even. However, the consumption of "modulo" operation is still relatively large. This is done in HashMap: call the indexFor(int h, int length) method to calculate which index of the table array the object should be saved at. The code of the indexFor(int h, int length) method is as follows:

static int indexFor(int h, int length) { return h & (length-1); }

This method is very clever. It uses h & (table.length -1) to get the storage bit of the object, and the length of the underlying array of HashMap It is always 2 to the nth power. This is the speed optimization of HashMap. There is the following code in the HashMap constructor:

int capacity = 1; while (capacity < initialCapacity) capacity <<= 1;

This code ensures that the capacity of HashMap is always 2 to the nth power during initialization, that is, the length of the underlying array is always 2 to the nth power.
When length is always 2 to the nth power, the h& (length-1) operation is equivalent to taking the modulus of length, that is, h%length, but & is more efficient than %.
When the length of the array is 2 raised to the nth power, the probability of calculating the same index for different keys is small, so the data is distributed evenly on the array, which means that the probability of collision is small. Relatively speaking, the query At this time, there is no need to traverse the linked list at a certain position, so the query efficiency is higher.

As can be seen from the source code of the put method above, when the program attempts to put a key-value pair into a HashMap, the program first determines the storage location of the Entry based on the hashCode() return value of the key: If the hashCode() return values of two Entry keys are the same, then their storage locations are the same. If the keys of these two Entries return true through equals comparison, the value of the newly added Entry will overwrite the value of the original Entry in the collection, but the key will not be overwritten. If the keys of these two Entries return false through equals comparison, the newly added Entry will form an Entry chain with the original Entry in the collection, and the newly added Entry is located at the head of the Entry chain - continue to see the description of the addEntry() method for specific instructions. .
(2) Read

public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); for (Entry e = table[indexFor(hash, table.length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; }

有了上面存储时的hash算法作为基础,理解起来这段代码就很容易了。从上面的源代码中可以看出:从HashMap中get元素时,首先计算key的hashCode,找到数组中对应位置的某一元素,然后通过key的equals方法在对应位置的链表中找到需要的元素。

3) 归纳起来简单地说,HashMap 在底层将 key-value 当成一个整体进行处理,这个整体就是一个 Entry 对象。HashMap 底层采用一个 Entry[] 数组来保存所有的 key-value 对,当需要存储一个 Entry 对象时,会根据hash算法来决定其在数组中的存储位置,在根据equals方法决定其在该数组位置上的链表中的存储位置;当需要取出一个Entry时,也会根据hash算法找到其在数组中的存储位置,再根据equals方法从该位置上的链表中取出该Entry。
4. HashMap的resize(rehash):
当HashMap中的元素越来越多的时候,hash冲突的几率也就越来越高,因为数组的长度是固定的。所以为了提高查询的效率,就要对HashMap的数组进行扩容,数组扩容这个操作也会出现在ArrayList中,这是一个常用的操作,而在HashMap数组扩容之后,最消耗性能的点就出现了:原数组中的数据必须重新计算其在新数组中的位置,并放进去,这就是resize。

那么HashMap什么时候进行扩容呢?当HashMap中的元素个数超过数组大小*loadFactor时,就会进行数组扩容,loadFactor的默认值为0.75,这是一个折中的取值。也就是说,默认情况下,数组大小为16,那么当HashMap中元素个数超过16*0.75=12的时候,就把数组的大小扩展为 2*16=32,即扩大一倍,然后重新计算每个元素在数组中的位置,而这是一个非常消耗性能的操作,所以如果我们已经预知HashMap中元素的个数,那么预设元素的个数能够有效的提高HashMap的性能。
5. HashMap的性能参数:
HashMap 包含如下几个构造器:
HashMap():构建一个初始容量为 16,负载因子为 0.75 的 HashMap。
HashMap(int initialCapacity):构建一个初始容量为 initialCapacity,负载因子为 0.75 的 HashMap。
HashMap(int initialCapacity, float loadFactor):以指定初始容量、指定的负载因子创建一个 HashMap。
HashMap的基础构造器HashMap(int initialCapacity, float loadFactor)带有两个参数,它们是初始容量initialCapacity和加载因子loadFactor。
initialCapacity:HashMap的最大容量,即为底层数组的长度。
loadFactor:负载因子loadFactor定义为:散列表的实际元素数目(n)/ 散列表的容量(m)。
负载因子衡量的是一个散列表的空间的使用程度,负载因子越大表示散列表的装填程度越高,反之愈小。对于使用链表法的散列表来说,查找一个元素的平均时间是O(1+a),因此如果负载因子越大,对空间的利用更充分,然而后果是查找效率的降低;如果负载因子太小,那么散列表的数据将过于稀疏,对空间造成严重浪费。
HashMap的实现中,通过threshold字段来判断HashMap的最大容量:

threshold = (int)(capacity * loadFactor);

结合负载因子的定义公式可知,threshold就是在此loadFactor和capacity对应下允许的最大元素数目,超过这个数目就重新resize,以降低实际的负载因子。默认的的负载因子0.75是对空间和时间效率的一个平衡选择。当容量超出此最大容量时, resize后的HashMap容量是容量的两倍:

if (size++ >= threshold) resize(2 * table.length);

6. Fail-Fast机制:
我们知道java.util.HashMap不是线程安全的,因此如果在使用迭代器的过程中有其他线程修改了map,那么将抛出ConcurrentModificationException,这就是所谓fail-fast策略。

这一策略在源码中的实现是通过modCount域,modCount顾名思义就是修改次数,对HashMap内容的修改都将增加这个值,那么在迭代器初始化过程中会将这个值赋给迭代器的expectedModCount。

HashIterator() { expectedModCount = modCount; if (size > 0) { // advance to first entry Entry[] t = table; while (index < t.length && (next = t[index++]) == null) ; } }

在迭代过程中,判断modCount跟expectedModCount是否相等,如果不相等就表示已经有其他线程修改了Map:
注意到modCount声明为volatile,保证线程之间修改的可见性。

final Entry nextEntry() { if (modCount != expectedModCount) throw new ConcurrentModificationException();

在HashMap的API中指出:

The iterators returned by the "collection view methods" of all HashMap classes are fail-fast: after the iterator is created, if the map is structurally modified, unless it is passed through the remove method of the iterator itself, any other If the time is modified in any way, the iterator will throw ConcurrentModificationException. Therefore, in the face of concurrent modifications, the iterator will quickly fail completely without risking arbitrary non-deterministic behavior at an undetermined time in the future.

Note that the fail-fast behavior of iterators cannot be guaranteed. Generally speaking, it is impossible to make any firm guarantees when there are asynchronous concurrent modifications. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it is a mistake to write a program that relies on this exception; the correct approach is that the fail-fast behavior of iterators should be used only to detect program errors.

Related recommendations:

In-depth understanding of the implementation principle of HashMap in java (picture)

Detailed explanation of the principle and implementation of java lock-free hashmap

The above is the detailed content of Analysis of the implementation principle of HashMap in Java. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:Reflection based on Java Next article:Reflection based on Java