-
Notifications
You must be signed in to change notification settings - Fork 2
To save parsed output #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
beniza
wants to merge
11
commits into
baijum:master
Choose a base branch
from
beniza:develop
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
5438a2d
add output and errorlog
beniza 5f80e7e
cleanup input; generate output
beniza d5c06f2
removed gitignore from tracking
beniza 82c2316
add feature to generate sfm
beniza 1954197
add section describing the lexical entry
beniza 846e0cf
Add description
beniza d90616b
fix a typo
beniza 9c627d3
Merge branch 'master' of https://github.com/beniza/bailey into develop
beniza 4e17176
add output data
beniza 02088de
add mdf ref, reorder entry fields
beniza 760a7de
move documentation files to the doc folder
beniza File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,25 @@ | ||
| # Bailey | ||
|
|
||
| Bailey is a fully customizable PEG based data parsing tool (primarily designed to parse dictionary data from plain text files) | ||
|
|
||
| ## How to use Bailey | ||
|
|
||
| At the moment there are no executables available for using Bailey. We might release a `cli` later. You may still be able to use this utility by cloning this repo and then running it on your local system. | ||
|
|
||
| ### Step 1ː Install dependencies | ||
| > $ python setup.py install | ||
|
|
||
| ### Step 2ː Run Bailey | ||
| > $ python bailey/baily.py path/to/dictionary_plain_text_file | ||
|
|
||
| Running the above will generate two outputs. 1) a dictionary representation of all the valid entries (entries that matches the expression in the grammar) in the dictionary_plain_text_file 2) every invalid entries in the file will be stored inside a `error.log` file with the corresponding line numbers. | ||
|
|
||
| ## Output | ||
| Bailey currently output data in Multi Dictionary Format (MDF) version 4.0. For more info please see the [description](docs/lexical_entry.md) of entries in the MDF. | ||
|
|
||
| ## Testing | ||
| To test the script, simply run | ||
|
|
||
| > $ python baily/baily.py | ||
|
|
||
| This should output a sample dictionary format. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -94,17 +94,20 @@ | |
| grammar = Grammar( | ||
| r""" | ||
| expr = (entry / emptyline )* | ||
| entry = hash headword comma pos ws senses subentry emptyline | ||
| entry = hash headword comma pos ws senses subentry period emptyline | ||
| # entry = hash headphrase comma pos ws senses subentry period emptyline | ||
| hash = (~"#")* | ||
| # headphrase = headword (ws headword)* | ||
| headword = ~"[A-Z 0-9 -]*"i | ||
| pos = (ws ~"[a-z]+\.")+ | ||
| pos = (ws ~"[a-z]+[\., ]")+ | ||
| subentry = (semicolon ws senses)* | ||
| senses = (sense comma)* sense | ||
| sense = (ml ws ml)* ml | ||
| ml = ~"[\u0d00-\u0d7f]*" | ||
| semicolon = ~";" | ||
| semicolon = ~"[;:]" | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in many places the keyboardists made typos where they put a : in the place of ;. Since we are not preserving the data, I thought of bypassing them. |
||
| comma = ~"," | ||
| ws = ~"\s*" | ||
| period = ~"." | ||
| emptyline = ws+ | ||
| """ | ||
| ) | ||
|
|
@@ -126,11 +129,11 @@ def visit_entry(self, node, visited_children): | |
| if visited_children[0].lstrip().startswith("#"): | ||
| return | ||
| output["lx"] = visited_children[1] | ||
| output["tx"] = datetime.date.today().isoformat() | ||
| output["ps"] = visited_children[3] | ||
| output["senses"] = visited_children[5] | ||
| output["sn"] = visited_children[5] | ||
| if visited_children[6]: | ||
| output["se"] = visited_children[6] | ||
| output["dt"] = datetime.date.today().isoformat() | ||
| return output | ||
|
|
||
| def visit_hash(self, node, visited_children): | ||
|
|
@@ -172,17 +175,60 @@ def generic_visit(self, node, visited_children): | |
| """ The generic visit method. """ | ||
| return visited_children or node | ||
|
|
||
| def parseData(data): | ||
| tree = grammar.parse(data) | ||
| dv = DictVisitor() | ||
| output = dv.visit(tree) | ||
| return (output) | ||
|
|
||
| def printOutput(item): | ||
| if type(item) == list: | ||
| for elem in item: | ||
| printOutput(elem) | ||
| elif type(item) == dict: | ||
| for k, v in item.items(): | ||
| if k.strip() == "lx": | ||
| o.write("\n") | ||
| o.write("\n\\{}\t".format(k)) | ||
| printOutput(v) | ||
| elif type(item) == str: | ||
| o.write(item.strip()) | ||
| else: | ||
| # pass | ||
| print("\\warn\tError! {}".format(item)) | ||
|
|
||
| def main(): | ||
| global data | ||
| if len(sys.argv) > 1: | ||
| filelocation = sys.argv[1] | ||
| f = open(filelocation, mode="r", encoding="utf-8") | ||
| data = f.read() | ||
|
|
||
| tree = grammar.parse(data) | ||
| dv = DictVisitor() | ||
| output = dv.visit(tree) | ||
| print(output) | ||
| dataset = f.readlines() | ||
| i = 1 | ||
| log = "" | ||
| output = [] | ||
| for data in dataset: | ||
| try: | ||
| output.append(parseData(data)) | ||
| i += 1 | ||
| except Exception as e: | ||
| log += ("{}\t{}".format(i, data)) | ||
| # print("Error on line {line_number}\t{missing_data}\t{err}".format(missing_data=data, err=e, line_number=i)) | ||
| i += 1 | ||
| pass | ||
| parseData(data) | ||
|
|
||
| global o | ||
| o = open("./data/output/dict.txt", mode="w", encoding="utf-8") | ||
| o.write("\\_sh v3.0 231 MDF 4.0\n") | ||
| # # o.write(str(output)) | ||
| # pickle.dump(output, o) | ||
| printOutput(output) | ||
| o.close() | ||
|
|
||
| if(len(log)): | ||
| errorLog = open("./data/output/error.log", mode="w", encoding="utf-8") | ||
| errorLog.write(log) | ||
| errorLog.close | ||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to capture the head words with multiple words. Most of the exception are due to this.
However I couldn't get this to work.