22. Cache Strategy [common]

Date: 2023-12-29

Status

Accepted

Context

We are addressing the need for efficient service response times, reduced load on external services, and improved data reuse and system resilience. This decision is influenced by a balance between speed, data freshness, and system resilience.

[!NOTE] A comprehensive understanding of caching’s impact on system performance and stability is key.

Decision

Our strategy will integrate both internal and external caching methods to optimize performance and scalability.

Strategy Components

Internal Caching:
- Use Case: For data requiring rapid access.
- Benefits: High speed, no network delays.
External Caching:
- Use Case: For managing larger data volumes.
- Benefits: Ease scalability.

[!TIP] We will employ a mix of ‘Cache Aside’, ‘Cache Through’, and ‘Cache Ahead’ strategies, depending on the specific requirements of each use case.

Data Eviction Policies

Policy	Description
FIFO	First-In-First-Out: The oldest data in the cache is evicted first.
LIFO	Last-In-First-Out: The most recently added data is evicted first.
Random	Random selection for eviction.
LRU	Least Recently Used: Data not accessed for the longest time is evicted.
MRU	Most Recently Used: The most recently accessed data is evicted. This is a specific case where older data is preserved over newer.
LFU	Least Frequently Used: Data that is least often accessed is evicted.
Belady’s OPT	Optimal Page Replacement: Theoretically the best page replacement algorithm, it evicts pages that will not be used for the longest period in the future.
Second Chance	A modification of FIFO, giving pages a “second chance” if they’ve been accessed recently.
Clock	Similar to Second Chance, organizes pages in a circular queue and gives a second chance before eviction.
2Q	Maintains two separate queues, one for recently accessed pages and another for frequently accessed pages, combining recency and frequency aspects of LRU and LFU.
SLRU	Segmented LRU: Divides the cache into two segments, a probationary and a protected segment, to differentiate between frequently and recently used data.
TLRU	Time Aware LRU: An extension of LRU that considers the age of the data, not just the usage pattern.
LRU-k	An extension of LRU, maintains the times of the last k references to each page, and uses this data for making eviction decisions.

[!NOTE] Selecting the right eviction policy is crucial as it affects the efficiency of the caching mechanism. It should align with the specific data access patterns of the application.

Error Caching

Caching errors can be a strategic approach to further reduce load and avoid repeated fetching of erroneous data.

Implementation: Store error responses in the cache.
Benefits: Subsequent requests will retrieve the error from the cache, preventing an unnecessary load on the data source.
Mitigation of Cache Miss Attacks: Prevents repeated cache misses caused by requests for data that is known to be erroneous.

[!WARNING] Care must be taken to set appropriate expiry times for cached errors to avoid persisting stale error states.

Effectiveness of Caching

To ensure that the caching system is effective, we will regularly measure its performance.

Average Response Time Formula

AverageTime = DBAccessTime * CacheMissRate + CacheAccessTime

DBAccessTime: Time to retrieve data from the database (e.g., 100ms).
CacheAccessTime: Time to retrieve data from the cache (e.g., 20ms).
CacheMissRate: The percentage of cache misses (e.g., 0.1).

A high CacheMissRate (e.g., > 0.8) indicates that caching may be counterproductive.

Monitoring with Prometheus Metrics

We will use Prometheus metrics to monitor key indicators such as cache hit-and-miss rates, and response times. This will help us in identifying performance issues and making the necessary adjustments.

Consequences

Benefits

Enhanced Service Response: Significantly faster responses.
Reduced External Service Load: Lower strain on external systems.

Challenges and Mitigation

[!WARNING] This strategy introduces complexities in cache management, particularly around maintaining data freshness and cache warming post-downtimes.

Regular Monitoring: Essential for identifying and resolving cache pollution and stale data.
Parameter Adjustment: Continuous tuning of cache settings to ensure effectiveness.

[!NOTE] Ongoing assessment and adaptation to changing data patterns and usage scenarios are crucial for the success of our caching strategy.