Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
8d44544
README updated. Running help added
Jan 2, 2013
422018c
added license file for imdbparser. #1
xaph Jan 2, 2013
6fbf7c0
Changes regex and hardcoded csv to seperator
aykutakin Jan 3, 2013
c2d056e
Adds genre parser
aykutakin Jan 4, 2013
2bf5032
Adds rating parser
aykutakin Jan 5, 2013
5a23522
#1 added full text of GPLv3
xaph Jan 6, 2013
3114bed
closes #1 project is GPLv3 now.
xaph Jan 6, 2013
fd43af0
started to convert filehandler to object
xaph Jan 6, 2013
4ef0b08
Adds directors parser and improves #TITLE regex
aykutakin Jan 8, 2013
fce0432
Fixes skipped rows numbers and flow
aykutakin Jan 8, 2013
e7d3cad
refactor Parsing classes and CLI arguments
destan Jan 10, 2013
d3d5067
fix help text: executing section
destan Jan 10, 2013
78a8d3f
updated example settings file to use all active lists
xaph Jan 12, 2013
ed0f417
removed real paths from example file
xaph Jan 12, 2013
9148a42
added database config to example settings file #11
xaph Jan 12, 2013
07684ef
parse_all method splitted as parse_one
xaph Jan 14, 2013
46921cb
file handler objectified
xaph Jan 14, 2013
6cb2e7c
parsers uses objectified file handler
xaph Jan 14, 2013
7360ce3
Fixes encoding problem & improves director regex
aykutakin Jan 15, 2013
abdc237
adds end of data separator and #fixes 13
aykutakin Jan 20, 2013
f8fdeb6
fixes #5 and fixes #4
aykutakin Jan 20, 2013
6a4c34d
fixes #16 and fixes #15
aykutakin Jan 31, 2013
39e2074
fixes #14
aykutakin Jan 31, 2013
ecb0352
Adds time to folder and initial commit for trivia
aykutakin Jan 31, 2013
9eca8f8
Adds trivia parser
aykutakin Feb 9, 2013
d46dfb4
fix 'No such file or directory' error
destan Feb 26, 2013
2c5b96f
add comment to debugging snippet, format comments
destan Feb 27, 2013
d791cb9
add comments
destan Feb 27, 2013
438c98b
partial fix #11: db class implemented
xaph Mar 3, 2013
ca56e05
add python-google-api-python-client dependency to README
yasakbulut Mar 5, 2013
0f34982
fix wrong package name. doh!
yasakbulut Mar 5, 2013
48554a1
add initial version of freebaseagent.
yasakbulut Mar 5, 2013
b0f067b
remove unneccessary dependency
yasakbulut Mar 7, 2013
9457f95
add comments, missing parameter in getImdbId()
yasakbulut Mar 7, 2013
9cd9b4c
close #12 use decorators to measure parsing durations
destan Mar 9, 2013
de75c86
db interactions removed
xaph Mar 9, 2013
ef4c6c0
#26 #30 alpha stage of db output
xaph Mar 10, 2013
b191b32
#26 movie names escaped for sql inserts
xaph Mar 10, 2013
d9e1100
updated output directory path. removed space character from directory…
xaph Mar 10, 2013
75cee99
closes #31 file creation moved to base parser. base parser checks whi…
xaph Mar 10, 2013
91d9d36
fix args logs
destan Mar 10, 2013
1d65cb1
edit 'finish log' to be more informative
destan Mar 10, 2013
bba3d31
add log to print outputfolder at the end
destan Mar 10, 2013
989c6c3
fix #35 plot parser produces 0 byte in tsv mode
destan Mar 10, 2013
d9bb083
add -u option and smth about dump files to README
destan Mar 10, 2013
d4149d1
fix typo in -h help text
destan Mar 10, 2013
e1f1d96
closes #26 sql dumps for movies added.
xaph Mar 10, 2013
7afe94b
#25 genres SQL added.
xaph Mar 10, 2013
81d5c98
adds more parametric db operations
Mar 16, 2013
6807407
adds more parametric db operations
Mar 16, 2013
19a2859
tries to implement naming convention
Mar 18, 2013
1598830
Adds sql scripts
aykutakin Mar 22, 2013
c882428
Adds commit to sql files
aykutakin Mar 25, 2013
7aa48a7
Fix sql file bug
aykutakin Apr 14, 2013
67c094c
Some files cannot be opened with utf-8
aykutakin May 12, 2013
f9e1bcd
'FileHandler' has no attribute 'get_full_path' error fixed.
Jun 29, 2013
179b941
Fixed path string building
mairbek Jul 26, 2013
a0640e5
Merge pull request #36 from mairbek/fixpath
xaph Jul 26, 2013
20671ab
closes #7 writing processed lines to console increases time too much
aykutakin Aug 3, 2013
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

35 changes: 31 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@ imdb-data-parser
Parses the IMDB dumps into CSV and Relational Database insert queries
Uses IMDB dumps from: http://www.imdb.com/interfaces

imdb-data-parser is a free software licensed by GPLv3.


Requirements
================
Python 3.x
* Python 3.x

Configuring
================
Expand All @@ -18,9 +21,33 @@ You need to copy this file as `settings.py` and edit this file before running th
cp settings.py.example settings.py
your_favourite_editor settings.py

Execute
-------
You also need to have dump files at `INPUT_DIR` and you can download dump files from one of the FTP addresses on http://www.imdb.com/interfaces.

Besides that you can make `imdb-data-parser` dowload dumps for you by giving `-u` argument:

~/imdb-data-parser$ ./imdbparser.py -u

~/imdb-data-parser$ python3 imdbparser.py
Executing
---------

~/imdb-data-parser$ ./imdbparser.py

You can use -h parameter to see list of optional arguments

~/imdb-data-parser$ ./imdbparser.py -h

SQL Dumps
---------
You can use mode parameter to create SQL dumps

~/imdb-data-parser$ ./imdbparser.py -h

Default configuration of MySQL doesn't allow insert data more than 16MB. You need to change your mysql max_allowed_packet size to insert sql dumps.

max_allowed_packet = 256M

Our movies data includes series, videos, tv shows for now. You can exclude them by this command:

grep -v '("\\"' movies.list.sql | grep -v '\\(VG\\)' | grep -v "\\(TV\\)" | grep -v "{" | grep -v "????" | grep -v "(V\\\)" > movies.sql

Note: SQL dumps tested with only mysql.
109 changes: 109 additions & 0 deletions idp/parser/actorsparser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
"""
This file is part of imdb-data-parser.

imdb-data-parser is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

imdb-data-parser is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with imdb-data-parser. If not, see <http://www.gnu.org/licenses/>.
"""

from .baseparser import *


class ActorsParser(BaseParser):
"""
RegExp: /(.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$/gm
pattern: (.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$
flags: gm
12 capturing groups:
group 1: (.*?) surname, name
group 2: #TITLE (UNIQUE KEY)
group 3: (.*? \(\S{4,}\)) movie name + year
group 4: (\(\S+\)) type ex:(TV)
group 5: (\{(.*?) ?(\(\S+?\))?\}) series info ex: {Ally Abroad (#3.1)}
group 6: (.*?) episode name ex: Ally Abroad
group 7: (\(\S+?\)) episode number ex: (#3.1)
group 8: (\{\{SUSPENDED\}\}) is suspended?
group 9: (\(.*?\)) info 1
group 10: (\(.*\)) info 2
group 11: (\[.*\]) role
group 12: ()
"""

# properties
base_matcher_pattern = '(.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$'
input_file_name = "actors.list"
number_of_lines_to_be_skipped = 239
db_table_info = {
'tablename' : 'actors',
'columns' : [
{'colname' : 'name', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'surname', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'title', 'colinfo' : DbScriptHelper.keywords['string'] + '(255) NOT NULL'},
{'colname' : 'info_1', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'info_2', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'role', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'}
],
'constraints' : 'PRIMARY KEY(title)'
}
end_of_dump_delimiter = ""

name = ""
surname = ""

def __init__(self, preferences_map):
super(ActorsParser, self).__init__(preferences_map)
self.first_one = True

def parse_into_tsv(self, matcher):
is_match = matcher.match(self.base_matcher_pattern)

if(is_match):
if(len(matcher.group(1).strip()) > 0):
namelist = matcher.group(1).split(', ')
if(len(namelist) == 2):
self.name = namelist[1]
self.surname = namelist[0]
else:
self.name = namelist[0]
self.surname = ""

self.tsv_file.write(self.name + self.seperator + self.surname + self.seperator + self.concat_regex_groups([2,9,10,11], None, matcher) + "\n")
elif(len(matcher.get_last_string()) == 1):
pass
else:
logging.critical("This line is fucked up: " + matcher.get_last_string())
self.fucked_up_count += 1

def parse_into_db(self, matcher):
is_match = matcher.match(self.base_matcher_pattern)

if(is_match):
if(len(matcher.group(1).strip()) > 0):
namelist = matcher.group(1).split(', ')
if(len(namelist) == 2):
self.name = namelist[1]
self.surname = namelist[0]
else:
self.name = namelist[0]
self.surname = ""

if(self.first_one):
self.sql_file.write("(\"" + self.name + "\", \"" + self.surname + "\", " + self.concat_regex_groups([2,9,10,11], [2,3,4,5], matcher) + ")")
self.first_one = False;
else:
self.sql_file.write(",\n(\"" + self.name + "\", \"" + self.surname + "\", " + self.concat_regex_groups([2,9,10,11], [2,3,4,5], matcher) + ")")

elif(len(matcher.get_last_string()) == 1):
pass
else:
logging.critical("This line is fucked up: " + matcher.get_last_string())
self.fucked_up_count += 1
109 changes: 109 additions & 0 deletions idp/parser/actressesparser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
"""
This file is part of imdb-data-parser.

imdb-data-parser is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

imdb-data-parser is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with imdb-data-parser. If not, see <http://www.gnu.org/licenses/>.
"""

from .baseparser import *


class ActressesParser(BaseParser):
"""
RegExp: /(.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$/gm
pattern: (.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$
flags: gm
12 capturing groups:
group 1: (.*?) surname, name
group 2: #TITLE (UNIQUE KEY)
group 3: (.*? \(\S{4,}\)) movie name + year
group 4: (\(\S+\)) type ex:(TV)
group 5: (\{(.*?) ?(\(\S+?\))?\}) series info ex: {Ally Abroad (#3.1)}
group 6: (.*?) episode name ex: Ally Abroad
group 7: (\(\S+?\)) episode number ex: (#3.1)
group 8: (\{\{SUSPENDED\}\}) is suspended?
group 9: (\(.*?\)) info 1
group 10: (\(.*\)) info 2
group 11: (\[.*\]) role
group 12: ()
"""

# properties
base_matcher_pattern = '(.*?)\t+((.*? \(\S{4,}\)) ?(\(\S+\))? ?(?!\{\{SUSPENDED\}\})(\{(.*?) ?(\(\S+?\))?\})? ?(\{\{SUSPENDED\}\})?)\s*(\(.*?\))?\s*(\(.*\))?\s*(\[.*\])?\s*(<.*>)?$'
input_file_name = "actresses.list"
number_of_lines_to_be_skipped = 241
db_table_info = {
'tablename' : 'actresses',
'columns' : [
{'colname' : 'name', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'surname', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'title', 'colinfo' : DbScriptHelper.keywords['string'] + '(255) NOT NULL'},
{'colname' : 'info_1', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'info_2', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'},
{'colname' : 'role', 'colinfo' : DbScriptHelper.keywords['string'] + '(127)'}
],
'constraints' : 'PRIMARY KEY(title)'
}
end_of_dump_delimiter = ""

name = ""
surname = ""

def __init__(self, preferences_map):
super(ActressesParser, self).__init__(preferences_map)
self.first_one = True

def parse_into_tsv(self, matcher):
is_match = matcher.match(self.base_matcher_pattern)

if(is_match):
if(len(matcher.group(1).strip()) > 0):
namelist = matcher.group(1).split(', ')
if(len(namelist) == 2):
self.name = namelist[1]
self.surname = namelist[0]
else:
self.name = namelist[0]
self.surname = ""

self.tsv_file.write(self.name + self.seperator + self.surname + self.seperator + self.concat_regex_groups([2,9,10,11], None, matcher) + "\n")
elif(len(matcher.get_last_string()) == 1):
pass
else:
logging.critical("This line is fucked up: " + matcher.get_last_string())
self.fucked_up_count += 1

def parse_into_db(self, matcher):
is_match = matcher.match(self.base_matcher_pattern)

if(is_match):
if(len(matcher.group(1).strip()) > 0):
namelist = matcher.group(1).split(', ')
if(len(namelist) == 2):
self.name = namelist[1]
self.surname = namelist[0]
else:
self.name = namelist[0]
self.surname = ""

if(self.first_one):
self.sql_file.write("(\"" + self.name + "\", \"" + self.surname + "\", " + self.concat_regex_groups([2,9,10,11], [2,3,4,5], matcher) + ")")
self.first_one = False;
else:
self.sql_file.write(",\n(\"" + self.name + "\", \"" + self.surname + "\", " + self.concat_regex_groups([2,9,10,11], [2,3,4,5], matcher) + ")")

elif(len(matcher.get_last_string()) == 1):
pass
else:
logging.critical("This line is fucked up: " + matcher.get_last_string())
self.fucked_up_count += 1
Loading