Testing and Development
A large portion of the repository is dedicated to corpus extraction and model creation. Scripts that facilitate these tasks are located in the
Testing files are in the
auto subdirectory, contains a set of unit-tests to exercise and verify system operation. The script
RunAllUnitTests.py, executes all tests in the
auto directory and prints out a final summary. Note that it is setup to run from the
test directory. If you try to run it from elsewhere, likely it won't find the individual test files without some modification.
manual directory contains various scripts used to exercise system functionality and facilitate debugging.
accuracy directory contains scripts and modules used to create the accuracy data.
Files in the
scripts directory are predominantly used to build the resources needed to drive the run-time system. Directories are numbered to indicate the order they need to be run. Likewise, scripts in the directories have a numerical prefix to indicate order. Additional libraries are required to run these including,
keras and a Keras backend such as
There is a README.txt file in the
01_BuildLexicon directory with information on how to get started. Generally you will need to..
- Create a
data_repodirectory or link in the main directory
- Download the various corpora used for building the resource (see the README.txt)
- Review the
lemmatizer\config.pyfile. This contains the location of data read and written. You will likely need to change the source locations for things like the Gutenberg and Billion Word Corpora.
There is no formal documentation for these scripts but there is a lot of comments inside the code. If you wish to dig into these files, plan to spend some time learning how the code operates as they are not intended for use by a casual end-user.