cs262a-paper/related_work.tex at master · aksp/cs262a-paper · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
\section{Related Work}

PeerDB is related to existing research in database cracking, materialized views, customizing key value stores, and discussions in the MongoDB development community. Our automatic embedding algorithm is related to work in automatically tuning databases.

\subsection{Materialized views}

A {\em view} is a function from a set of base tables to a derived table.
A {\em materialized view} is where you store tuples of a view in the database itself.
This way, database accesses to the materialized view can be faster than recomputing the view, especially when computing the view is expensive~\cite{Gupta1995}.
Materialized views for expensive queries would not work very well if you needed to recompute the materialized view whenever records were added or deleted, so past work has addressed how to efficiently maintain materialized views with incremental updates~\cite{Larson1985,Blakeley1986,Gupta1995,Zhou2007,Zhou2007a}.

In general, materialized views relate to our work because they create and update copies of data to decrease response time for expensive queries.
While materialized views can be constructed for arbitrary SQL query, PeerDB limits embedding only to fields for natural joins between documents.
By this it simplifies logic and addresses the most common use case.

\subsection{Column store architecture and database cracking}

As we are restructuring our data to optimize read time, column store databases and database cracking are related to our project.
A {\em column store} relational database stores the values for each attribute contiguously, unlike traditional row store databases that store attributes of a record contiguously~\cite{Stonebraker}.
Such column store databases allow the DBMS to read only the values of the columns required for processing a given query so that they perform better in read-mostly applications.
Products such as Synbase IQ and KDB have demonstrated that this architecture improves performance~\cite{Stonebraker,French1995}.

Idreos et al.~provide an optimization for column stores called {\em database cracking}~\cite{Pirk2007}.
In database cracking, a column $A$ is copied as $A_{CRK}$ when $A$ is first queried.
Then, $A_{CRK}$ is physically organized so that the values that satisfy the query are stored in contiguous space.
Thus, database cracking speeds up subsequent queries for similar values.
Follow up work proposed algorithms for updating cracked databases under high-volume insertions/deletions~\cite{Idreos2007}, and algorithms for increasing efficiency of tuple reconstruction for multi-attribute queries~\cite{Idreos2009}.

We are inspired by these successful methods for copying and reorganizing data to speed up read-heavy applications, but those methods target optimizations at the database level itself in a way how data is stored on permanent storage, while we are addressing how data itself is structured and send to the client to lower number of queries and query time.

\subsection{Adding functionality to non-relational databases}

NoSQL-style databases sacrifice ``one-size-fits-all'' functionality for speed~\cite{Strauch}.
Thus, programmers build extra functionality on top of the simple database to satisfy application-specific needs.
Many research papers detail systems that supplement NoSQL databases to support complex queries, ACID properties, and SLAs~\cite{Decandia2007,Chang,Beaver2010,Baker}.

To our knowledge, no existing academic work addresses how to decrease the number of database round trips required to resolve recursive object relations in document store databases such as MongoDB.
The MongoDB manual~\cite{MongoDB2014} and a NoSQL survey paper~\cite{Strauch} notes application categories that inform when to (A) embed related objects and when to (B) reference related objects.
However, many applications do not comfortably fit either category.
So, application developers wrote guides for normalizing objects so that you can both embed and reference objects~\cite{Wanschik2010}.
Unfortunately, following these guides requires a lot of effort and introduces opportunity for error.
Further, no one has quantified benefits or downsides of denormalizing objects in MongoDB.

\subsection{Automatically tuning databases}
Our work provides an algorithm to automatically suggest which fields to embed for a given workload.
Therefore, work on automatically tuning databases is relevant.

Several past works suggest methods for automatically selecting database indexes and materialized views~\cite{Goldstein2001,Zilio2004,Gupta2005,Yang1997,Agrawal2000}. We were particularly inspired by the work of Agrawal et. al. which enumerated many configurations of materialized views, then compared the configurations based on their expected impact on the sum of the cost of queries in a workload~\cite{Agrawal2000}. Many follow up papers followed a similar methodology~\cite{Goldstein2001,Zilio2004,Gupta2005}.

As these works are intended for specific SQL configurations, we could not directly apply these methods. However, our algorithm is structured similarly to the algorithm proposed in~\cite{Agrawal2000}: we enumerate many configurations of embedded fields, then compare these configurations based on their expected impact on the sum of the cost of reads and writes of a given workload.
We propose and evaluate a novel cost model to evaluate proposed configurations of embedded vs. referenced fields based on a given workload.

%\bibliography{papers}
% indexes and automatic decision
% TODO: Add related work about automatic algorithm