A sinhala/english text search engine for sinhala songs and lyrics
B L O D Hemachandra - 160197M
You must have the following prerequisites on your machine/server
- Clone or download the reposity
- Open a terminal on the project root directory
- Run
sh start.sh - Wait for a few minutes to initialize the servers. (you can check status of containers by running
docker ps -a) - Visit http://localhost:5000 in your browser
After running the application you will have the access to a nice web interface that you can search and browse for your favourite sinhala songs, artists, lyrics, etc.
In order to stop servers and remove all container artifacts, run sh stop.sh
- Search by title - ආදරේ මන්දිරේ
- Search by artist - එඩ්වඩ් ජයකොඩි ගායනා කල ගී
- Search by lyrics - ඔබ සඳක් සේම පායා මට එලිය දුන් වගයි
- Search by genre - චිත්රපට ගීත
- Search by songwriter - රත්න ශ්රී ලියූ ගීත
- Faceted Search - Filter the search results based on artist, music, lyrics, genre
- Range Queries - ජනප්රියම ගීත 20, රත්න ශ්රී ලියූ හොඳම සිංදු 10, වික්ටර් රත්නායක ගායනා කළ හොඳම ගීත 10
- Search with synonyms - එඩ්වඩ් ජයකොඩි ගායනා කල ගී, එඩ්වඩ් ජයකොඩි ගැයූ සිංදු
- English support for searching (limited level) - Amaradewa top 10, Songs written by Rathna Sri
| Folder | Description |
|---|---|
| UI | Flask server and web UI files |
| Data | Scraped sinhala songs and DB mount folder |
| Scraper | Python web scraper |
| Solr-Engine | Apache Solr |
Song lyrics along with the metadata was scraped from Sinhala Song Book website. The website consists of following data fields for each song.
- Title (Sinhala & English)
- Artist (English)
- Genre (English)
- Lyrics (Sinhala)
- Songwriter (English)
- Music (English)
- Key
- Beat
- Views (Count)
Artist, Genre, Songwriter, Music fields were translated to sinhala in order to support both sinhala and english for searching.
- Tokenization
- Rule based classification
A rule based classification has been used to classify the search queries into different categories of searches. Different rules are applied based on the search phrase keywords after splitting the search phrase into tokens.
- Boosting
Boosting has been used as a query optimization technique. Each song data field of a search is boosted with a predefined value based on the keywords in the search phrase.
- Inverted index
Following methods have been used as querying techniques.
- Multi-match query
- Boolean query
- Aggregations - to create facets
- Sorting
- Range