My AI project uses software from different places on different platforms to do different things. Briefly,
- software to calculate measures of entropy: JE, MI, an VI. Written in Java by a group mate.
- software to split our datasets into 5 folds. Written in C# by another group mate.
- software to generate association rules for our dataset. A binary provided by Christian Borgelt.
- software to measure the performance (accuracy, precision, recall) of the class association rules. Written in C# by a group mate.
- Excel to look at the data.
Most of those have GUIs that my group originally intended on using. Because I hate wasting time and using the mouse, I've since created a BASH script framework to automate it all.
Our Framework
First I went to each piece of software we owned and modified it to work in a headless manner. Instead of giving input through the GUI, we do it through command-line arguments. Instead of displaying the output in the GUI, it outputs it to a file or to stdout (to be piped into a file). For Borgelt's apriori software, there is a command-line binary already, so we figured out the parametres it would want to match the GUI we had planned to use.
Then, I wrote multiple small scripts, each handling one task, and then chained them together, so we can go from a single directory with CSV datasets, to generating entropy, generating rules, selecting only the rules we want, testing their classification ability, doing feature selection on the original datasets across over a dozen different proposed entropy thresholds, generating rules on those featured selected sets (over a dozen for each dataset and measure combination), and then testing classification performance for those reduced feature sets.
Then there's a script that averages and batches together all the performance results across those combinations of dataset-measure-threshold-fold (5 data sets x 3 measures x 16 thresholds (just checked) x 5 folds = 1200 records) and creates a single 240 record spreadsheet reporting fields dataset, measure, threshold, avg. accuracy, avg. precision, avg. recall, and time to generate the rules.
It completes the first four datasets in around ½ an hour. The fifth one is larger and takes a few hours, so I usually start working with the results from the first four first. :D
No comments:
Post a Comment