Testing and Development
A large portion of the repository is dedicated to corpus extraction and model creation. Scripts that facilitate these tasks are located in the test
and scripts
directories.
Testing
Testing files are in the tests
directory
The auto
subdirectory, contains a set of unit-tests to exercise and verify system operation. The script RunAllUnitTests.py
, executes all tests in the auto
directory and prints out a final summary. Note that it is setup to run from the test
directory. If you try to run it from elsewhere, likely it won't find the individual test files without some modification.
The manual
directory contains various scripts used to exercise system functionality and facilitate debugging.
The accuracy
directory contains scripts and modules used to create the accuracy data.
Development
Files in the scripts
directory are predominantly used to build the resources needed to drive the run-time system. Directories are numbered to indicate the order they need to be run. Likewise, scripts in the directories have a numerical prefix to indicate order. Additional libraries are required to run these including, nltk
, keras
and a Keras backend such as tensorflow
.
There is a README.txt file in the 01_BuildLexicon
directory with information on how to get started. Generally you will need to..
- Create a
data_repo
directory or link in the main directory - Download the various corpora used for building the resource (see the README.txt)
- Review the
lemmatizer\config.py
file. This contains the location of data read and written. You will likely need to change the source locations for things like the Gutenberg and Billion Word Corpora.
There is no formal documentation for these scripts but there is a lot of comments inside the code. If you wish to dig into these files, plan to spend some time learning how the code operates as they are not intended for use by a casual end-user.