For the following project we have chosen a basic but effective stack of technology.
For the database management we have chosen neo4j due to the following factors:
- Extensive plugin library (APOC was used)
- Updated documentation
- Scalability (operations with over 5 million entries in CSV and over 2 million nodes in graph)
- Flexibility (self-hosted, real-time updates)
- Cypher Language
To prepare the database command-line tools, we used python with neo4j library imported.
We have chosen python due to its simplicity and versatility. By implementing the neo4j library in the script, it's possible to interact with the database.
It is important to note that the whole project was made on a MacBook Pro with M1 Pro ARM architecture with macOS Sonoma 14.2.1. Results may vary for different systems, but the code was optimized for unix-like systems.
The main component of the project is the database. The command line tool in python interacts with the database through the neo4j library.
To ensure proper and fast setup and configuration the user should have the following ready:
Software modules: Python 3.11+, pip3, JDK17
Database: neo4j 20.0.0 +
Other Requirements: APOC plugin, chrome web browser, taxonomy.csv file, 2GB of free disk space
The above is the set of software the user is required to obtain prior to setup.
After installing all software from Prerequisites chapter perform the following steps.
To clone the repository to the desired folder use:
git clone https://github.com/vpofg/db-tool.git
From here on now, the directory for the python scripts is the one created with clone
First it is needed to allocate the installed APOC plugin in the neo4j directory. It is necessary that the .jar file is located in the $NEO4J_HOME\plugin directory
Next the .csv file containing all the data, needs to be put in the $NEO4J_HOME\import directory to access it later.
With python, pip and neo4j library installed navigate to the directory where thedbcli.py file is located and use the command:
python3 dbcli <task_number> <"node_name"> <"node_name_optional">To run the tool.
Important note: Please ensure that in the python script uri, username and password math the ones defined by the user. The default port for bolt connection is 7687, but it is possible that a user has a service already running on that port
In order to ensure proper run of the app, in the command line on your PC direct to the neo4j directory (defined during installation) and from there:
To run the console:
$NEO4J_HOME\bin\neo4j console
and to run as background service:
$NEO4J_HOME\bin\neo4j start
Where $NEO4J_HOME is the downloaded neo4j package.
We recommend running neo4j as a console in order not to overuse resources on the users PC.
After the console is started, in the web browser of your choice (preferably google chrome) go to: http://localhost:7474. You will be greeted with a dashboard for queries. The default username and password are both neo4j. Change the password to database. Now the project is setup and ready to be used.
As the project runs locally, the user has to populate the database with data themselves by employing two queries. First we have to upload the contents of the CSV file into the database. To do that the user has to place the correct path to the import directory of neo4j and run the following query:
CALL apoc.periodic.iterate(
'LOAD CSV FROM "file:///path_to_import_directory/taxonomy.csv" AS row RETURN row',
'WITH row, trim(row[0]) AS parentCategory, trim(row[1]) AS childCategory
WHERE parentCategory IS NOT NULL AND parentCategory <> "" AND childCategory IS NOT NULL AND childCategory <> ""
MERGE (parent:Category {name: parentCategory})
MERGE (child:Category {name: childCategory})
MERGE (child)-[:IS_SUBCATEGORY_OF]->(parent)',
{batchSize:1000, parallel:false}
);The process of filling the database should take around one minute and when its done you will see the message that around 5 million nodes where created.
If everything went correctly, it is possible to start using the dbcli tool. To use it, refer to "Preparing the CLI environment" in "Installation and Setup
An alternative way is to use the import.py script, but the user still has to correct the path in the file.
The database was designed on a simple premise based on the structure of the CSV file. The file is constructed in a way that each line represents a link between two nodes, where the first entry is the parent and the second is the child. We have decided to connect the nodes through IS_SUBCATEGORY_OF relationship. Each node in the dataset has a property name.The relationship looks as follows:
(child)-[:IS_SUBCATEGORY_OF]->(parent)
The import query is created using APOC plugin of neo4j to make use of the periodic iterate function which breaks the file import into smaller chunks which improves efficiency.
Then each entry is assigned a parent or child category following the pattern of the CSV "parent_node", "child_node" With the import done, we could go to creating queries for the 12 tasks.
In order to navigate through the data we had to create queries that solve 12 of the problems.
MATCH (parent:Category {name: $nodeName})<-[:IS_SUBCATEGORY_OF]-(child:Category)
RETURN child.name AS ChildName;
MATCH (parent:Category {name: $nodeName})<-[:IS_SUBCATEGORY_OF]-(child:Category)
RETURN count(child) AS numberOfChildren;
MATCH (grandparent:Category {name: $nodeName})<-[:IS_SUBCATEGORY_OF*2]-(grandchild:Category)
RETURN grandchild.name;
MATCH (child:Category {name: $nodeName})-[:IS_SUBCATEGORY_OF]->(parent:Category)
RETURN parent.name;
MATCH (child:Category {name: $nodeName})-[:IS_SUBCATEGORY_OF]->(parent:Category)
RETURN count(parent) AS numberOfParents;
MATCH (grandchild:Category {name: $nodeName})-[:IS_SUBCATEGORY_OF*2]->(grandparent:Category)
RETURN grandparent.name;
MATCH (c:Category)
RETURN count(distinct c.name) AS uniqueNodesCount;
MATCH (root:Category)
WHERE NOT (root)-[:IS_SUBCATEGORY_OF]->(:Category)
RETURN root.name;
MATCH (parent:Category)<-[:IS_SUBCATEGORY_OF]-(child:Category)
RETURN parent.name, count(child) AS numberOfChildren
ORDER BY numberOfChildren DESC
LIMIT 10;
Finds nodes with the least children (number of children is greater than zero), there could be more the one
MATCH (parent:Category)<-[:IS_SUBCATEGORY_OF]-(child:Category)
WITH parent, count(child) AS numberOfChildren
WHERE numberOfChildren > 0
RETURN parent.name, numberOfChildren
ORDER BY numberOfChildren ASC
LIMIT 10;
MATCH (n:Category {name: $oldName})
SET n.name = $newName;
MATCH path = (start:Category {name: $startNode})-[:IS_SUBCATEGORY_OF*]-(end:Category {name: $endNode})
RETURN path;
The command-line tool was created in python using the neo4j library in python in order to ensure connection to the DB and run queries.
Important Note: To use the dbcli, neo4j client has to be running as in User Manual
Each task has its own function defined which takes in appropriate number of arguments and runs the query in the DB and later displays the result to the user.
Each query is surrounded by timestamps using ```per_counter()``` function
def find_children(driver, node_name):
with driver.session() as session:
t1_start = perf_counter()
result = session.run(
"MATCH (parent:Category {name: $nodeName})<-[:IS_SUBCATEGORY_OF]-(child:Category) "
"RETURN child.name",
nodeName=node_name
)
t1_end = perf_counter()
for record in result:
print(record["child.name"])
print(t1_end - t1_start)We can observe that when the session is connected, the query is executed and two timestamps are created before and after execution and then subtracted to print the outcome. The perf_counter() function measures the performance of the program not relying on wall-clock time but on the high-resolution timers such as CPU timestamp counter.
For improving the efficiency, a simple indexing was implemented
CREATE INDEX FOR (c:Category) ON (c.name);
The above, improves search by name, which is the most common in all the designed queries.
- Setting up the database
- Designing the query for data upload
- The query was created and uploaded, there were many versions of the query as working with a big data model like this one was demanding for the system.
- First 6 queries
- Documentation
- Designing last 6 queries
- Designing the python script
- The script had to be implemented in a way that the command in terminal takes in predefined values and uses appropriate function. The program also takes note and calculates time of execution.
- Help with documentation
- Final presentation
By following the instructions closely, any user should obtain similar results, altough it is worthy to note that the computer hardware might influence the times of execution.
If all of the previous steps from Installation and Setup and User Manual where reproduced corectly the user should be able to use the dbcli tool on the database. To recap:
- Connect to the DB
- Direct to
$NEO4J_HOME\bin\neo4j consoleor$NEO4J_HOME\bin\neo4j startto initialize the DB service - Go to
http://localhost:7474 - Enter the credentials
bolt://localhost7687user:neo4jpassword:neo4j(has to be changed to database)
- Direct to
- Apply the import query defined in User Manual
- Initialize dbcli tool
- In the terminal/cmd direct to the directory where
dbcli.pyfile is located - to use the tool write
python3 dbcli.py ex_number "argument_1" "argument_2"
- In the terminal/cmd direct to the directory where
Below are the result of running each query and checking the times with the perf_counter() function.
| Query Number | T1 (in ms) | T2 | T3 |
|---|---|---|---|
| 1 | 19.02 | 32.24 | 13.28 |
| 2 | 19.12 | 12.38 | 14.64 |
| 3 | 80.44 | 28.81 | 13.61 |
| 4 | 68.82 | 13.52 | 12.56 |
| 5 | 54.04 | 10.93 | 11.28 |
| 6 | 58.18 | 13.86 | 12.49 |
| 7 | 60.45 | 18.35 | 11.20 |
| 8 | 87.42 | 11.69 | 16.10 |
| 9 | 13.11 | 17.10 | 15.38 |
| 10 | 11.91 | 11.65 | 10.28 |
| 11 | 164.22 | 21.60 | 10.71 |
| 12 | 63.53 | 9.12 | 7.26 |
For each task the node "1880s_films" was used, for
The tech stakc chosen was relativelly basic, but has offered a stable way to manage the database. The biggest issue we had was with the csv loading query. We have started with just a simple one line query using LOAD CSV FROM and we gradualy made improvements to accomodate for such a grand dataset. We have ended up with time of around 192 ms for starting streaming 1 record and around 62204 ms for completing the whole operation.
The choice of python was relatively easy, as it is an easy to develop language for scripts and also works on most machines. Being non-compiler based also helped with deployment of the tool.
The queries designed where easy for the most part. We have had some trobles with query number 9, as we did not know how to accomodate for possily multiple nodes with the same amount of children. It was a problem we did not manage to fix and right now the query does display 10 of the biggest queries but we cannot be sure that it doesn't loose some nodes.
The upload speed is rather in the middle ground. 6 minutes is rather a very average time for workig with a dataset of this size. When it comes to query execution, the times vary for each run and depending on the node we are working with in each task.
The project is surely lacking in some spaces and we belive that it could be imporved by introducing some changes and extensions.
One idea to expand would be hosting the database in the cloud and minimizing the need for user interference also some scripts could be more automated.
Also some of the queires could be improved to accomodate for some outliers in the data, or extrodinary events such as query number 9.# db-tool