Caches

Many functions in uproot may be given a cache to avoid expensive re-reading or re-calculating previously read or calculated quantities. These functions assume nothing about the cache other than a dict-like interface: square brackets get old values and set new ones and in checks for existence. Therefore, a dict may be used as a “save forever” cache, but of course you might not have enough memory to save all data in memory forever.

The classes described on this page are drop-in replacements for dict with additional properties: least-recently-used (LRU) eviction, which drops the oldest cache item upon reaching a memory budget, as well as thread safety and process safety.

This interface, in which the user instantiates and passes the cache object explicitly instead of turning on an internal cache option, is to avoid situations in which the user can’t determine where large amounts of memory are accumulating. When uproot reads a ROOT file, it does not save anything for reuse except what it puts in user-provided caches, so the user can always inspect these objects to see what’s being saved.

The array-reading functions (in TTreeMethods and TBranchMethods) each have three cache parameters:

  • cache for fully interpreted data. Accessing the same arrays with a different interpretation or a different entry range results in a cache miss.
  • basketcache for raw basket data. Accessing the same arrays with a different interpretation or a different entry range fully utilizes this cache, since the interpretation/construction from baskets is performed after retrieving data from this cache.
  • keycache for basket TKeys. TKeys are small, but require file access, so caching them can speed up repeated access.

Passing explicit caches to basketcache and keycache will ensure that the file is read only once, while passing explicit caches to cache and keycache will ensure that the file is read and interpreted only once.

uproot.cache.MemoryCache

class uproot.cache.memorycache.MemoryCache(limitbytes, spillover=None, spill_immediately=False, items=(), **kwds)

A dict with a least-recently-used (LRU) eviction policy.

This class implements every dict method and is a subclass of dict, so it may be used anywhere dict is used. Unlike dict, the least recently used key-value pairs are removed to avoid exceeding a user-specified memory budget. The memory budget may be temporarily exceeded during the process of setting the item. Memory use of the key and value are computed with sys.getsizeof and the memory use of internal data structures (a list, a dict, and some counters) are included in the memory budget. (The memory used by Numpy arrays is fully accounted for with sys.getsizeof.)

Like dict, this class is not thread-safe.

Like dict, keys may be any hashable type.

Attributes, properties, and methods:

  • numbytes (int) the number of bytes currently stored in this cache.
  • numevicted (int) the number of key-value pairs that have been evicted.
  • promote(key) declare a key to be the most recently used; raises KeyError if key is not in this cache (does not check spillover).
  • spill(key) copies a key-value pair to the spillover, if any; raises KeyError if key is not in this cache (does not check spillover).
  • spill() copies all key-value pairs to the spillover, if any.
  • do(key, function) returns the value associated with key, if it exists; calls the zero-argument function, sets it to key and returns that if the key is not yet in the cache.
  • all dict methods, following Python 3 conventions, in which keys, values, and items return iterators, rather than lists.
Parameters:
  • limitbytes (int) – the memory budget expressed in bytes. Note that this is a required parameter.
  • spillover (None or another dict-like object) –

    another cache to use as a backup for this cache:

    • when key-value pairs are evicted from this cache (if spill_immediately=False) or put into this cache (if spill_immediately=True), they are also inserted into the spillover;
    • when keys are not found in this cache, the spillover is checked and, if found, the key-value pair is reinstated in this cache;
    • when the user explicitly deletes a key-value pair from this cache, it is deleted from the spillover as well.

    Usually, the spillover for a MemoryCache is a DiskCache so that data that do not fit in memory migrate to disk, or so that the disk can be used as a persistency layer for data that are more quickly accessed in memory.

    The same key-value pair might exist in both this cache and in the spillover cache because reinstating a key-value pair from the spillover does not delete it from the spillover. When spill_immediately=True, every key-value pair in this cache is also in the spillover cache (assuming the spillover has a larger memory budget).

  • spill_immediately (bool) – if False (default), key-value pairs are only copied to the spillover cache when they are about to be evicted from this cache (the spillover is a backup); if True, key-value pairs are copied to the spillover cache immediately after they are inserted into this cache (the spillover is a persistency layer).
  • items (iterable of key-value 2-tuples) – ordered pairs to insert into this cache; same meaning as in dict constructor. Unlike dict, the order of these pairs is relevant: the first item in the list is considered the least recently “used”.
  • **kwds – key-value pairs to insert into this cache; same meaning as in dict constructor.

uproot.cache.ThreadSafeMemoryCache

class uproot.cache.memorycache.ThreadSafeMemoryCache(limitbytes, spillover=None, spill_immediately=False, items=(), **kwds)

A dict with a least-recently-used (LRU) eviction policy and thread safety.

This class is a thread-safe version of MemoryCache, provided by a global lock. Every method acquires the lock upon entry and releases it upon exit.

See MemoryCache for details on the constructor and methods.

uproot.cache.ThreadSafeDict

class uproot.cache.memorycache.ThreadSafeDict(items=(), **kwds)

A dict with thread safety.

This class is a direct subclass of dict with a global lock. Every method acquires the lock upon entry and releases it upon exit.

uproot.cache.DiskCache

class uproot.cache.diskcache.DiskCache(*args, **kwds)

A persistent, dict-like object with a least-recently-used (LRU) eviction policy.

This class is not a subclass of dict, but it implements the major features of the dict interface:

  • square brackets get objects from the cache (__getitem__), put them in the cache (__setitem__), and delete them from the cache (__delitem__);
  • in checks for key existence (__contains__);
  • keys, values, and items return iterators over cache contents.

Unlike dict, the least recently used key-value pairs are removed to avoid exceeding a user-specified memory budget. The memory budget may be temporarily exceeded during the process of setting the item.

Unlike dict, all data is stored in a POSIX filesystem. The only data the in-memory object maintains is a read-only config, the directory name, and read, write functions for deserializing/serializing objects.

Unlike dict, this cache is thread-safe and process-safe— several processes can read and write to the same cache concurrently, and these threads/processes do not need to be aware of each other (so they can start and stop at will). The first thread/process calls create to make a new cache directory and the rest join an existing directory. Since the cache is on disk, it can be joined even if all processes are killed and restarted.

Do not use the DiskCache constructor: create instances using create and join only.

The cache is configured by a read-only config.json file and its changing state is tracked with a state.json file. Key lookup is performed through a shared, memory-mapped lookup.npy file. When the cache must be locked, it is locked by locking the lookup.npy file (fcntl.LOCK_EX). Read and write operations only lock while hard-linking or renaming files— bulk reading and writing is performed outside the lock.

The names of the keys and their priority order is encoded in a subdirectory tree, which is updated in such a way that no directory exceeds a maximum number of subdirectories and the least and most recently used keys can be identified without traversing all of the keys.

The lookup.npy file is a binary-valued hashmap. If two keys hash to the same value, collisions are resolved via JSON files. Collisions are very expensive and should be avoided by providing enough slots in the lookup.npy file.

Unlike dict, keys must be strings.

Attributes, properties, and methods:

  • numbytes (int) the number of bytes currently stored in the cache.
  • config.limitbytes (int) the memory budget expressed in bytes.
  • config.lookupsize (int) the number of slots in the hashmap lookup.npy (increase this to reduce collisions).
  • config.maxperdir (int) the maximum number of subdirectories per directory.
  • config.delimiter (str) used to separate order prefix from keys.
  • config.numformat (str) Numpy dtype of the lookup.npy file (as a string).
  • state.numbytes (int) see numbytes above.
  • state.depth (int) current depth of the subdirectory tree.
  • state.next (int) next integer in the priority queue.
  • refresh_config() re-reads config from config.json.
  • promote(key) declare a key to be the most recently used; raises KeyError if key is not in the cache.
  • keys() locks the cache and returns an iterator over keys; cache is unlocked only when iteration finishes (so evaluate this quickly to avoid blocking the cache for all processes).
  • values() locks the cache and returns an iterator over values; same locking warning.
  • items() locks the cache and returns an iterator over key-value pairs; same locking warning.
  • destroy() deletes the directory— all subsequent actions are undefined.
  • do(key, function) returns the value associated with key, if it exists; calls the zero-argument function, sets it to key and returns that if the key is not yet in the cache.
static DiskCache.create(limitbytes, directory, read=<function anyread>, write=<function anywrite>, lookupsize=10000, maxperdir=100, delimiter='-', numformat=<type 'numpy.uint64'>)

Create a new disk cache.

Parameters:
  • limitbytes (int) – the memory budget expressed in bytes.
  • directory (str) – local path to the directory to create as a disk cache. If a file or directory exists at that location, it will be overwritten.
  • read (function (filename, cleanup) ⇒ data) – deserialization function, used by “get” to turn files into Python objects (such as arrays). This function must call cleanup() when reading is complete, regardless of whether an exception occurs.
  • write (function (filename, data) ⇒ None) – serialization function, used by “put” to turn Python objects (such as arrays) into files. The return value of this function is ignored.
  • lookupsize (int) – the number of slots in the hashmap lookup.npy (increase this to reduce collisions).
  • maxperdir (int) – the maximum number of subdirectories per directory.
  • delimiter (str) – used to separate order prefix from keys.
  • numformat (numpy.dtype) – type of the lookup.npy file.
Returns:

first view into the disk cache.

Return type:

DiskCache

static DiskCache.join(directory, read=<function anyread>, write=<function anywrite>, check=True)

Instantate a view into an existing disk cache.

Parameters:
  • directory (str) – local path to the directory to view as a disk cache.
  • read (function (filename, cleanup) ⇒ data) – deserialization function, used by “get” to turn files into Python objects (such as arrays). This function must call cleanup() when reading is complete, regardless of whether an exception occurs.
  • write (function (filename, data) ⇒ None) – serialization function, used by “put” to turn Python objects (such as arrays) into files. The return value of this function is ignored.
  • check (bool) – if True (default), verify that the structure of the directory is a properly formatted disk cache, raising ValueError if it isn’t.
Returns:

view into the disk cache.

Return type:

DiskCache

uproot.cache.arrayread

uproot.cache.diskcache.arrayread(filename, cleanup)

Sample deserialization function; reads Numpy files (*.npy) into Numpy arrays.

To be used as an argument to create or join.

Parameters:
  • filename (str) – local path to read.
  • cleanup (function () ⇒ None) – cleanup function to call after reading is complete.
Returns:

Numpy array.

Return type:

numpy.ndarray

uproot.cache.arraywrite

uproot.cache.diskcache.arraywrite(filename, obj)

Sample serialization function; writes Numpy arrays into Numpy files (*.npy).

To be used as an argument to create or join.

Parameters:
  • filename (str) – local path to overwrite.
  • obj (numpy.ndarray) – array to write.

uproot.cache.memmapread

uproot.cache.diskcache.memmapread(filename, cleanup)

Lazy deserialization function; reads Numpy files (*.npy) as a memory-map.

To be used as an argument to create or join.

Parameters:
  • filename (str) – local path to read.
  • cleanup (function () ⇒ None) – cleanup function to call after reading is complete.
Returns:

cleanup function is called when this object is destroyed (__del__).

Return type:

wrapped numpy.core.memmap