Skip to content

Initial caching decorator for HttpRequestExecutors.#27

Open
dmfs wants to merge 2 commits intomainfrom
feature/16-caching-decorator
Open

Initial caching decorator for HttpRequestExecutors.#27
dmfs wants to merge 2 commits intomainfrom
feature/16-caching-decorator

Conversation

@dmfs
Copy link
Owner

@dmfs dmfs commented Sep 19, 2016

Implements #16

@dmfs dmfs added this to the 0.11 milestone Sep 19, 2016
@dmfs
Copy link
Owner Author

dmfs commented Sep 19, 2016

I've created an initial Cached decorator. It doesn't satisfy all requirements yet, but it should be sufficient for what we need in SmoothSetup.
There are still a few things to improve though.
@lemonboston please review and make suggestions on how to improve the (de-) serialization of request and response from/to the cache streams.

FTR, the basic idea is to write the relevant request data and the cached response data in a simple format to a stream and de-serialize it when necessary.
The file format is something like this:

<CACHE FILE VERSION>
<REQUEST METHOD>
<REQUEST URI>
<REQUEST HEADERS>

<RESPONSE URI>
<RESPONSE STATUS LINE>
<RESPONSE HEADERS>

<RESPONSE BODY>

Everything below <RESPONSE URI> looks basically like the original response (apart from any transfer encoding and the new line sequence, HTTP uses a CRLF new line sequence while this uses just LF characters for simplicity).
Should we switch to CRLF to allow the use of a standard response parser (once we have one)?

To use the decorator you would do something like this:

HttpRequestExecutor executor = new Cached(
    uncachedExecutor,
    new AllowGetCachePolicy(),
    new SimpleCache(
        new Gzipped(
            new FilesystemStorage(
                new File("/path/to/cache/dir"),
                0))));

@dmfs
Copy link
Owner Author

dmfs commented Sep 19, 2016

Also, I'm not 100% sure if it's a good idea to make this a separate module. Would you say it's better if this goes to the decorators module?
The point is that the Cached implementation (with all the supplementary classes) will be potentially large, while the other decorators are rather small. On the other hand I'm planning to outsource some of the cache logic (i.e. the cache and the storage packages) since they might be useful in other scenarios as well. This will reduce the complexity of the module.

@lemonboston
Copy link

Wow, you just wrote a whole module :)

I looked into it, here are the things I noted, any of them can easily be wrong of course, but I hope there is something useful among them:


I think it's good as a separate module because it's big as you mentioned, so client doesn't need to bundle it if they only use Following for example. If it is stripped down by outsourcing some parts later, I guess it can still be moved to decorators that time if you see fit.


Regarding the cache versioning, I think it could be a subdirectory instead of a file header (I think I saw this practice somewhere). And the version number could be handled by the Cache/Storage internally, so not by Cached, so there would be no need to check that on that high level. Cache/Storage would not return old version entry. (Removing old version subfolder could be part of general eviction (btw, do you plan anything for eviction?) or part of (first time) initialization.)


I think it would be good if the file format was internal to a component that handles both reads and writes, so the format definition belongs to it, other implementations could use different ones. (Currently they are in Cached and SimpleResponseFactory).
So as I looked at it I felt that a layer might be missing, Cached implementation probably should not need to be concerned with low level stream reads and writes in specific formats. So I though a bit about what a cache should provide on high level with our API and come to this:

public interface Cache2
{
    // returns null if not exist
    HttpResponse get(URI uri, HttpRequest request);

    // returns decorated response that write the response body to the cache also
    HttpResponse put(URI uri, HttpRequest request, HttpResponse response);
}

Most of the existing code would go under it with a little refactor. In particular there may not be a need for CacheEntity. But of course there is a chance I miss something.
I saw the duplicate stream write, this could be again handled internally by the Cache decorating the result of put call.
Defining the cache key would also be internal to the actual Cache this way.
The Cached decorator would simplify to this, and it would be general:

public final class Cached2 implements HttpRequestExecutor
{
    private final HttpRequestExecutor mDelegate;
    private final CachePolicy mCachePolicy;
    private final Cache2 mCache;


    public Cached2(HttpRequestExecutor delegate, CachePolicy cachePolicy, Cache2 cache)
    {
        mDelegate = delegate;
        mCachePolicy = cachePolicy;
        mCache = cache;
    }


    @Override
    public <T> T execute(URI uri, HttpRequest<T> request) throws IOException, ProtocolError, ProtocolException, RedirectionException, UnexpectedStatusException
    {
        if (!mCachePolicy.isCacheable(request))
        {
            return mDelegate.execute(uri, request);
        }

        HttpResponse response = mCache.get(uri, request);
        if (response == null)
        {
            return executeCached(uri, request);
        }

        return request.responseHandler(response).handleResponse(response);
    }


    private <T> T executeCached(final URI uri, final HttpRequest<T> request) throws IOException, ProtocolError, ProtocolException
    {
        HttpRequest<T> decoratedRequest = new ResponseDecorated<T>(request, new Decoration<HttpResponse>()
        {
            @Override
            public HttpResponse decorated(final HttpResponse original)
            {
                if (!mCachePolicy.isCacheable(original))
                {
                    return original;
                }
                return mCache.put(uri, request, original);
            }
        });
        return mDelegate.execute(uri, decoratedRequest);
    }

}

Btw, I didn't understand how these conditions could be true if the cache entry was for that request, so I didn't included them above:

if (!request.method().verb().equals(verb))
..
if (!uriString.equals(uri.toASCIIString()))

I don't know if there is a need for clearing an entry, but a method could be added to Cache: void clear(uri, request) for that.


Should we switch to CRLF to allow the use of a standard response parser (once we have one)?

I am not sure I understand the question - do you mean that response parser used by let's say smoothsync api client, which would also be a general one, so it could be used at other places, too, could be used to parse the stored cached response as well? If yes, I guess it would make sense.

@lemonboston
Copy link

please review and make suggestions on how to improve the (de-) serialization of request and response from/to the cache streams.

I am not sure if you meant the serialization implementations or design, sorry that I only commented about the design. If that proposal makes sense I would start from that and after refactor, and I would expect serialization and deseralization coming under same class and then easier to see if there is anything to improve, reuse, change there.

@lemonboston
Copy link

One more small note:
put method on the proposed Cache could alternatively just return OutputStream and stream duplication decoration would be applied in Cached since that is general, it would probably be the same for all Cache implementation.

@dmfs
Copy link
Owner Author

dmfs commented Sep 20, 2016

I like the idea of adding an HttpCache that takes care of (de-) serialization and the actual file format. Though I still would keep the current Cache interface as some sort of low level cache. A generic cache might be useful in other (non-HTTP) use cases as well.

Also I prefer to keep the CacheEntity. Not only I think it's useful as a wrapper for the cache entry as such, but also as some kind of "handle" of the entry. Once we support evicting elements from the cache we need a "handle" that can be locked to make sure we don't evict any cache entries while a request for that entry is going on.

Keeping different versions of the cached files in different directories sounds nice, but is difficult do do. Mostly because the cache implementation doesn't know anything about the actual storage.

The tests make sure that the cached request matches the sent request. The cache key is derived from a hash algorithm and while it's very unlikely that two different requests have the same key, it's still possible. The cache must always work correctly and should never return the wrong response for a request. These additional tests are cheap, so there is no reason not to perform them. Later on we may have to perform additional checks.

Regarding CRLF: this is what HTTP uses. The cache pretty much stores the original response. The idea is: if we store the response in an HTTP compliant way we could use a standard HTTP parser to parse the parse the cached response. In other words: if we decide to implement our own ("low level") HTTP client we could use the parser that parses the cache entries.

Here is another design consideration: In certain scenarios the Cached decorator is needs the cached response to build a request. If the server returns an ETag the client can use it to retrieve an entity only if the entity has been changed on the server. The client sends the ETag in an if-none-match header when doing an request. If the entity was changed, the server returns the entity as usual. Otherwise it returns a 304 status code, which is a redirect to the "cache". So if the server responds with 304 the Cached decorator should return the response from the cache. This means the client must not evict the cached response while the request is being executed, because it might still be needed in that case.

@lemonboston
Copy link

Ok, thanks for the further explanations and background. I see your points on keeping current Cache as well.

About using the version for directory, it is a bit forced, but how about this:

public final class SimpleCache implements Cache
{
    private static final String CACHE_VERSION = "3"; 

    private final Storage mStorage;

    public SimpleCache(StorageFactory storageFactory)
    {
        this.mStorage = storageFactory.create(/*param: identifier*/CACHE_VERSION);
    }

And the factory for FileSystemStorage would append the version to basedir.
Although version might go into HttpCache a level higher actually.... anyways, just an idea to maybe solve it with some factory.

Let me know if you would like me to check anything else.

@dmfs
Copy link
Owner Author

dmfs commented Sep 20, 2016

I think something like this should work for now:

public interface HttpCache
{

    /**
     * Returns a {@link CacheKey} for the given {@link HttpRequest}.
     * <p>
     * Note that a the {@link CacheKey} does not mean, that the cache contains any response for the given request.
     */
    CacheKey key(HttpRequest<?> request);

    /**
     * Checks if the cache contains an {@link HttpResponse} for the given {@link CacheKey}
     */
    boolean hasResponse(CacheKey key);

    /**
     * Returns the cached {@link HttpResponse} for the given {@link CacheKey}.
     */
    HttpResponse cachedResponse(CacheKey key) throws NoSuchElementException;

    /**
     * Caches the given {@link HttpResponse} to the given {@link HttpRequest}. Since the response will be consumed by this method it returns another {@link
     * HttpResponse} that contains the same data and can be forwarded to the receiver.
     */
    HttpResponse cachedResponse(HttpRequest<?> request, HttpResponse response);
}

Not sure if that works but we could use the CacheKey as kind of a lock to protect entries from being evicted, i.e. we store the CacheKey in a WeakHashMap and don't allow the respective entry to be removed as long as the key is still in the map (i.e. as long as someone else still holds a reference to it). We don't need to decide that now. For the initial release it's okay if the cache never evicts any entries.

@lemonboston
Copy link

I probably don't know enough about the topic, but I am thinking about this use case to protect entries from eviction while request for them is ongoing.. I suppose they would happen very rarely in practice, right? But to make it correct, it affects design considerations now, so let's say 0.001% of calls affects the design (and general implementation) now, so I am thinking if it could be handled differently, like with retrying? So for ETAGs, when the cached response got evicted during the call, we would retry without the ETAG header to get a new response. It would make the logic for ETAG handling (an executor decorator?) a bit more complex, but it would be contained there, it wouldn't need to affect other components, design.
Or maybe we could pre-fetch the cached response from the cache for requests with ETAG headers.

I am also not sure how eviction is usually implemented, do you maybe have any plans for it? Something scheduled, or run after each call async, or at initialization?

I like the separate hasResponse method on the interface and the other with exception.
Isn't URI param missing from the 1st and the last method?
If I may add, I think different names for the last two methods would be better to avoid any confusion, since they are semantically different operations.

@lemonboston
Copy link

About hasResponse() + cachedResponse() combination, I think this would be a clear use case for using Optional. I cannot think of a better example to use it as a return value than this. It would clearly express that it is valid that the value may be absent without any the need to read documentation about returning null or throwing exception, or no need to warn to call hasXY method first. Plus if we are looking at this eviction scenario, in theory, if didn't use any extra locking, response could be evicted between hasResponse and cachedResponse calls resulting in exception. I know the CacheKey locking you mentioned would solve this too, but I just wanted to mention this Optional usage option here as I really miss it sometimes :).
(Any chance to use a 3rd party or our own Optional class?)

@dmfs
Copy link
Owner Author

dmfs commented Sep 21, 2016

I don't have any specific plans for an eviction mechanism. But it surely depends on the algorithm if we need to worry about concurrent eviction.

You're right about the URI parameter in the first method. We probably don't need it for the last one, because the response should already have it.

@dmfs dmfs force-pushed the feature/16-caching-decorator branch from 385b7c9 to 16e709b Compare May 9, 2017 11:25
@lemonboston lemonboston assigned dmfs and unassigned lemonboston Jun 27, 2017
@lemonboston
Copy link

@dmfs I've reassigned this to you as I am not sure about the status of this.

@dmfs dmfs removed this from the 0.13 milestone Nov 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants