The old method felt a bit suspicious to me since I've first seen it.
Before, the mapping was just stored as an array for two main reasons:
- Insertion happens much more often than lookup (why?).
- Often data is looked up in order, so a lasthit variable was stored to speedup the next lookup.
While both assumptions are true, I think are not a valid reason to use an array instead of a hashmap.
Some numbers (mostly based on one file from Spring but I also checked other files, the ratios remain similar):
- There are roughly 4 times more insertions than lookups. (for Spring files usually like 2-4 million insertions and 500.000-1.000.000 lookups)
- # of lookups in 1 step: 360.000 [Edit: might have counted this incorrectly, but focus is on the other numbers anyway]
- # of lookups in <10 steps: 63.000
- # of lookups in <100 steps: 25.000
- # of lookups in <1000 steps: 40.000
- # of lookups in <10000 steps: 50.000
- # of lookups in <100000 steps: 15.000
- # of lookups in >100000 steps: 20
What I learn from this is that optimizing for the common case is the wrong approach.
Even if 90% of lookups would be extremely fast. It does not help if in the other cases need more than 1000 steps on average.
This patch implements a simple hash map based on the approach used for Python dicts since Python 3.6 (although this implementation lacks many of the features).
Note: If someone decides to add back the lasthit approach, that is still possible because the data structure preserves insertion order (and could also be ordered in any way)!
However I think that is not necessary. The lookup time is not the significant factor anymore because there are 4x more insertions than lookups.
Another benefit of this patch is, that it reduces code complexity imo, because it gets rid of lasthit. This was an implementation detail that leaked into different parts of the loading code.
For actual numbers, see comment below.