This program creates, populates and queries a SQL relational database with two tables containing information about movies and their genres acquired from IMDb.
- Introduction
- Features
- Requirements
- Installation
- Usage
- Code Explanation
- Observations
- Summary and Suggestions
This programs purpose is to utilize existing Go package for SQLite to create a SQL relational database, populate the database with two tables, and finally query the database and report the results. The database is created using this package. The database is populated with data acquired from the IMDb archive that Northwestern University maintains. This archive can be found here.
- Creates a temporary directory for the database.
- Creates the schema for the database.
- Populates two tables for movies, and movies genres.
- Has two functions as example queries.
- Logs the results of the queries.
Ensure that Go is installed. Install go here.
This program requires the installation of sqlite package. The repository for this package can be found here.
To install the package, open terminal and run:
go get modernc.org/sqliteThis will install the package, allowing you to import it into the program without issues.
Note that the program has dependencies that need to be included in the go.mod file. However, go tidy will automatically add this line in the file if you included the package in the import section of main.go.
To use the package, first clone the repository in your directory:
git clone https://github.com/hamodikk/movieDB.gitChange directory to the repository:
cd <path/to/repository/directory>You can run the code either from the executable or by running the following in your terminal:
go run .\main.goThis program consists of one helper function that validates the headers of the csv files, 1 function that creates the database and the schema, 2 functions that populate the movies and the movies_genres tables and lastly, 2 functions to query the database.
- Helper function to validate the csv file headers:
func validateHeaders(tableName string, headers []string) bool {
var expectedHeaders []string
// switch statement to handle different csv files
switch tableName {
case "movies":
expectedHeaders = []string{"id", "name", "year", "rank"}
case "movies_genres":
expectedHeaders = []string{"movie_id", "genre"}
default:
return false
}
// Match the headers of csv with the expected headers
return strings.Join(headers, ",") == strings.Join(expectedHeaders, ",")
}- Function to create the database and the schema:
func newSchema(movieDbFile string) (*Movies, *Movies_genres, error) {
// Schema creation statement
schema := `
CREATE TABLE movies (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
year INTEGER NOT NULL,
rank REAL
);
CREATE TABLE movies_genres (
movie_id INTEGER NOT NULL,
genre TEXT NOT NULL,
FOREIGN KEY (movie_id) REFERENCES movies(id)
);
`
// Error handling for file open and statement execution
db, err := sql.Open("sqlite", movieDbFile)
if err != nil {
fmt.Println("Error opening database:", err)
log.Printf("Error opening database: %v\n", err)
return nil, nil, err
}
if _, err := db.Exec(schema); err != nil {
db.Close()
fmt.Println("Error creating schema:", err)
log.Printf("Error creating schema: %v\n", err)
return nil, nil, err
}
return &Movies{
db: db,
}, &Movies_genres{
db: db,
}, nil
}- Function to populate the tables. The function to populate movies table and the movies_genres table are very similar, so I will only show the one for movies table.
func (m *Movies) populateMovies() error {
// Open the CSV file
moviesCSV, err := os.Open("001-IMDb/IMDB-movies.csv")
if err != nil {
fmt.Println("Error opening CSV file", err)
log.Printf("Error opening CSV file: %v\n", err)
return err
}
defer moviesCSV.Close()
- Initiate the reader, lazy quotes allows unescaped double quotes, preventing errors during population.
// Init csv reader
moviesReader := csv.NewReader(moviesCSV)
// I was losing about 400 rows due to unescaped
// double quotes, so I set LazyQuotes to accept these rows.
moviesReader.LazyQuotes = true
- Read the headers and validate them.
// Read the headers
moviesHeader, err := moviesReader.Read()
if err != nil {
fmt.Println("Error reading CSV header", err)
log.Printf("Error reading CSV header: %v\n", err)
return err
}
// Validate the headers
if !validateHeaders("movies", moviesHeader) {
fmt.Println("Unexpected CSV headers")
log.Println("Unexpected CSV headers")
return err
}
- Start a transaction. This allows for all the insert commands to be committed at the same time instead of individually, improving performance.
// Start a transaction
tx, err := m.db.Begin()
if err != nil {
fmt.Println("Error starting transaction:", err)
log.Printf("Error starting transaction: %v\n", err)
return err
}
defer func() {
if err != nil {
tx.Rollback()
}
}()
- Create batches and append rows to the batch to execute multiple rows as a single
INSERTstatement.
// Create an interface to hold the batch values
const batchSize = 100
values := make([]interface{}, 0, batchSize*4)
insertStmt := "INSERT INTO movies (id, name, year, rank) VALUES"
validRowCount := 0
totalMovies := 0
// Read the rest of the rows, skip the problematic rows and insert the rest into the database
rowNumber := 1
for {
moviesRecord, err := moviesReader.Read()
if err != nil {
if err == io.EOF {
break
}
fmt.Printf("Skipping problematic row %d: %v\n", rowNumber, err)
log.Printf("Skipping problematic row %d: %v\n", rowNumber, err)
rowNumber++
continue
}
// Append the values to the interface
values = append(values, moviesRecord[0], moviesRecord[1], moviesRecord[2], moviesRecord[3])
insertStmt += "(?, ?, ?, ?),"
validRowCount++
// When the rows reach the batch size, put the insert statement into the transaction
if validRowCount == batchSize {
_, err := tx.Exec(insertStmt[:len(insertStmt)-1], values...)
if err != nil {
fmt.Printf("Error inserting batch at row %d: %v", rowNumber, err)
log.Printf("Error inserting batch at row %d: %v", rowNumber, err)
return err
}
totalMovies += 100
// Reset the interface for next batch
values = values[:0]
insertStmt = "INSERT INTO movies (id, name, year, rank) VALUES"
validRowCount = 0
}
rowNumber++
}
INSERTthe rows that don't add up to the batch size (100 rows).
// Insert the remaining values < batchSize
if len(values) > 0 {
_, err := tx.Exec(insertStmt[:len(insertStmt)-1], values...)
if err != nil {
fmt.Printf("Error inserting the remaining batch: %v", err)
log.Printf("Error inserting the remaining batch: %v", err)
return err
}
totalMovies += len(values) / 4
}
- Commit the transaction.
// Commit the tx
if err := tx.Commit(); err != nil {
fmt.Println("Error committing transaction:", err)
log.Printf("Error committing transaction: %v", err)
return err
}
fmt.Println("Movies table populated successfully")
fmt.Printf("Total movies inserted: %d\n", totalMovies)
log.Println("Movies table populated successfully")
log.Printf("Total movies inserted: %d\n", totalMovies)
return nil
}- Function to query the database. Similarly, the two queries use a similar function, so I will only show here the one that reports the highest rated genres.
- Create the query statement.
func queryDbHighestRatedGenres(db *sql.DB) error {
query := `
SELECT
mg.genre,
AVG(m.rank) AS avg_rank,
COUNT(m.id) AS movie_count
FROM
movies_genres mg
JOIN
movies m
ON
mg.movie_id = m.id
WHERE
m.rank IS NOT NULL AND m.rank != 'NULL'
GROUP BY
mg.genre, m.rank
ORDER BY
avg_rank DESC
LIMIT 20;
`
- Execute the statement, handle any errors.
// Debug
fmt.Println("Executing query...")
rows, err := db.Query(query)
if err != nil {
log.Printf("error querying database: %v\n", err)
return fmt.Errorf("error querying database: %v", err)
}
defer rows.Close()
- Scan and print the results row by row. Additional error handling while iterating the rows, as well as logging the results.
fmt.Printf("Top 20 highest rated genres:\n")
fmt.Printf("%-20s %-10s %-10s\n", "Genre", "Avg Rating", "Movie Count")
fmt.Println(strings.Repeat("-", 40))
log.Printf("Top 20 highest rated genres:\n")
log.Printf("%-20s %-10s %-10s\n", "Genre", "Avg Rating", "Movie Count")
log.Println(strings.Repeat("-", 40))
rowCount := 0
for rows.Next() {
var genre string
var avgRating *float64
var movieCount int
if err := rows.Scan(&genre, &avgRating, &movieCount); err != nil {
fmt.Printf("error scanning row: %v", err)
log.Printf("error scanning row: %v", err)
continue
}
fmt.Printf("%-20s %-10.2f %-10d\n", genre, *avgRating, movieCount)
log.Printf("%-20s %-10.2f %-10d\n", genre, *avgRating, movieCount)
rowCount++
}
// Check for errors during iteration
if err := rows.Err(); err != nil {
log.Printf("error iterating rows: %v\n", err)
return fmt.Errorf("error iterating rows: %v", err)
}
// log number of rows processed
if rowCount == 0 {
fmt.Println("No rows found")
log.Println("No rows found")
} else {
fmt.Printf("Total rows processed: %d\n", rowCount)
log.Printf("Total rows processed: %d\n", rowCount)
}
return nil
}I have implemented this program in parts, committing each update to the repository. Initially, I implemented only the generation of the schema, which was rather straightforward as I followed the example given on sqlite repository.
Implementing the populateMovies and populateMoviesGenres functions were challenging. I implemented a similar approach to my csv-to-json converter in terms of reading the csv files and validating the headers. My first approch was to execute each INSERT to the database on a row-by-row basis, which was extremely slow. Running just the populateMovies would take about 30-40 minutes. I then implemented transactions to my functions. Transactions allowed me to commit insertions together, reducing the overhead of accessing the database and committing individual lines. This has significantly improved my performance, where populating the movies table would now take about 1-2 minutes. I wanted to further improve the performance so that the database can be populated and queried within seconds. Implementing batching has allowed me to execute the INSERT statements in batches. For example, instead of the SQL command looking like this without batching:
INSERT INTO movies (id, name, year, rank) VALUES (1, 'Movie 1', 2020, 6.3);
INSERT INTO movies (id, name, year, rank) VALUES (2, 'Movie 2', 2022, 8.0);
.
.
.
INSERT INTO movies (id, name, year, rank) VALUES (100, 'Movie 1', 2019, 4.7);it would instead look like:
INSERT INTO movies (id, name, year, rank) VALUES
(1, 'Movie 1', 2020, 6.3),
(2, 'Movie 2', 2022, 8.0),
.
.
.
(100, 'Movie 1', 2019, 4.7);This reduces the number of executions and communications with the database, improving performance significantly.
It is important to note that the performance could further be improved by increasing batch size. A batch size of 100 means 3952 INSERT commands, whereas a batch size of 1000 would be 658 commands. Of course, this could be fine tuned by using a hybrid batching with 10000 batches, then 1000 batches, and finally 100 batches as the larger batches are not satisfied in number.
Another important point is that batching is made possible because I handle the potential errors, as well as implement the line moviesReader.LazyQuotes = true. I found out that practically all of the errors I encountered while populating were caused by unescaped double quotes either in the titles of the movies or the year section of the movies. Allowing lazy quotes prevents getting errors while populating the tables with rows that have unescaped double quotes.
It was also challenging to implement the query functions. It is difficult to not have the advantage of a SQL software like postgres, where I can see my query results immediately. Other then fixing some errors with my query statement, I didn't have any major updates necessary for my query functions. It is basically an execution of the query, followed by scanning the results and print them out for reporting. One important thing to note is that the movie count for the genres are much lower for "highest rated genres" compared to the movie count for the "movie count per genre" query. This is because majority of the movies have "NULL" as a movie rating, and since I get the average of the rating values, I have to filter these rows out when calculating the highest rated genres.
Moving forward, I could implement a third table with my collection of movies. This would be as simple as creating my own movie collection with movie names, movie locations, and personal ratings in an excel file and saving it as a csv file. I could integrate this personal collection with the existing tables by using a foreign key in the csv file that corresponds to the movies id in the movies table. I can then update the existing schema to include the personal movies table and populate it with the csv file I generated with my personal collection.
A useful personal movie database can serve not only as an organizational tool to keep track of movies a person has watched, but it can also serve as a tool to explore movies that a person might find interesting that they haven't watched before. For example, I have recently saw the video of a person who has watched thousands of movies and keeps track of them on an excel sheet, printing an updated version of it each year. A database could enhance that experience by adding the ability to query the movies a person has watched. Additionally, this database could be used to explore movies that a person hasn't watched yet. Such a database could have the personal collection table as well as an updated table with current movies on IMDb, and the user could search for a movie that they haven't watched before, that shares similarities with a specific movie that they have watched. The database can then query for movies that have similar genres, actors or ones that are made around the same time, and provide these movies to the user as suggestions.
This program creates, populates and queries a traditional SQL relational database containing two tables with movies and movies genres.
The program can be enhanced by adding more tables including actors, directors, roles, as well as a personal collection table. We can also improve the efficiency further by increasing the batch size to better accomodate even larger datasets.