API Reference

AlgorithmMeta

class mnist_classifier.algorithm_meta.AlgorithmMeta(report_directory: str = None, test_suite_iter: int = None)

The Algorithm parent class which contains all the basic algorithm methods. Most of the logic of the algorithm is done here. Indeed, other than setting up the algorithm to match given specs (like number of trees or hidden layers) the train/test mechanics is the same.

static calc_standard_error(error: float, sample_count: int)

Calculates the Wilson score interval with 95% confidence, based on the paper by Edwin B. Wilson 1.

1

Edwin B. Wilson (1927) Probable Inference, the Law of Succession, and Statistical Inference, Journal of the American Statistical Association

We interpret the results as the +/- of the error rate of the algorithm. For example, an error rate of 0.02 with 50 samples and a confidence of 95% yields 0.0388. So the error rate can be read as 0.02 +/- 0.0388.

Parameters
  • error (float) – the error rate of the test results

  • sample_count (int) – the number of test samples used

Returns

the standard error

Return type

float

display_results(cache)

Displays various graphs that are pertinent to the algorithm’s score (such as a confusion matrix)

Parameters

cache (dict) – the arguments to display (varies from algorithm to algorithm)

eval_train_test_cache(train_data, train_labels, test_data, test_labels)

Generates a cache object containing the test and test data, labels, and accuracies, and a copy of the model.

Parameters
  • train_data (numpy.array) – the raw training data

  • train_labels (numpy.array) – the ground truth of the training data

  • test_data (numpy.array) – the data to test against

  • test_labels (numpy.array) – the ground truth of the test data

Returns

dict

Return type

the actual data, predicted data, accuracy, and model in a dict format

fit(data, targets)

Fits the internal model on the given data, and returns it

Parameters
  • data (numpy.array) – the data on which you want to fit

  • targets (numpy.array) – the target classes of the training data you want to fit

Returns

sklearn.BaseEstimator

Return type

The trained model

load_model(filepath)

Loads the model from disk into the object’s model attribute

Parameters

filepath (str) – the path of the model on disk

predict(data_to_predict)

Returns prediction of the class y for input

Parameters

data_to_predict (numpy.array) – Sample data set on which to generate predictions

Returns

numpy.array

Return type

Array with the predicted class label

print_results(cache)

Prints the results of the classification, and returns them as a pandas DataFrame

Parameters

cache (dict) – the cache of a run_classification() function call.

Returns

the classification results as a single-line data frame

Return type

pandas.DataFrame

run_classification(train_data, train_labels, test_data, test_labels, model_to_save=None, model_to_load=None)

Trains and tests the classification

Parameters
  • train_data (numpy.array) – the data to train on

  • train_labels (numpy.array) – the labels of the train data

  • test_data (numpy.array) – the data to use to run predictions

  • labels (test) – the ground truth of the test data

  • model_to_load (str) – filepath of a saved model to load instead of train

  • model_to_save (str) – filepath on which to save the trained model

Returns

dict

Return type

Returns collection with prediction and accuracy

save_model(filepath)

Saves the trained model attribute to disk

Parameters

filepath (str) – the destination filepath to save to disk to.

save_results(results: pandas.core.frame.DataFrame)

Saves the results to disk as a CSV file if the report_directory is not None. If the output report file already exists, it will have lines appended to it

Parameters

results (pandas.DataFrame) – the results table to save to disk.

RandomForest

class mnist_classifier.random_forest.RandomForest(n_estimators, max_depth, criterion, random_seed: int = None, report_directory: str = None, test_suite_iter: int = None)

Random Forest which inherits from the AlgorithmMeta class

display_results(cache)

Displays various graphs that are pertinent to the algorithm’s score (such as a confusion matrix)

Parameters

cache (dict) – the arguments to display (varies from algorithm to algorithm)

load_model(filepath)

Loads the model from disk into the object’s model attribute

Parameters

filepath (str) – the path of the model on disk

print_results(cache)

Prints the results of the classification, and returns them as a pandas DataFrame

Parameters

cache (dict) – the cache of a run_classification() function call.

Returns

the classification results as a single-line data frame

Return type

pandas.DataFrame

MLP

class mnist_classifier.mlp.MLP(hidden_layer_sizes: tuple = 10, 10, 10, alpha: float = 0.0001, batch_size='auto', max_iter: int = 200, verbose: bool = False, random_seed: int = None, report_directory: str = None, test_suite_iter: int = None)

A basic MLP classifier

display_results(cache)

Displays various graphs that are pertinent to the algorithm’s score (such as a confusion matrix)

Parameters

cache (dict) – the arguments to display (varies from algorithm to algorithm)

load_model(filepath)

Loads the model from disk into the object’s model attribute

Parameters

filepath (str) – the path of the model on disk

print_results(cache)

Prints the results of the classification, and returns them as a pandas DataFrame

Parameters

cache (dict) – the cache of a run_classification() function call.

Returns

the classification results as a single-line data frame

Return type

pandas.DataFrame

Dataset

Downloads and prepares the dataset for use with other algorithms

mnist_classifier.dataset.load_test_data()

loads the test data

Returns

  • data (numpy.array) – 2D numpy array with the image data (one image per row)

  • labels (numpy.array) – 1D numpy array with the label for each corresponding image

mnist_classifier.dataset.load_train_data()

loads the training data

Returns

  • data (numpy.array) – 2D numpy array with the image data (one image per row)

  • labels (numpy.array) – 1D numpy array with the label for each corresponding image

Visualizer

Visualizer

mnist_classifier.visualizer.display_loss_curve(losses, save_location: str = None)

Plots and displays the loss curve (usually for Neural Network models)

Parameters
  • save_location (str) – the location to save the figure on disk. If None, the plot is displayed on runtime and not saved.

  • losses (numpy.array) – the losses array of the MLP classifier’s training.

Returns

the figure

Return type

matplotlib.pyplot.figure

mnist_classifier.visualizer.display_mlp_coefficients(coefficients, rows=4, cols=4, save_location: str = None)

Shows the first layer’s coefficients of the input layer

The first rows*cols neurons’ coefficients are displayed. if rows*cols is greater than the number of neurons, all the neurons are displayed. If there are more neurons’ worth of coefficients to display than rows*cols, only the first ones are displayed.

Parameters
  • numpy.array (coefficients) – 2D numpy array containing the input coefficients (or weights) of the MLP’s hidden layers. Only the first layer’s coefficients are displayed

  • rows (int) – the number of rows to display in the figure

  • cols (int) – the number of columns to display in the figure

  • save_location (str) – the location to save the figure on disk. If None, the plot is displayed on runtime and not saved.

Returns

the figure

Return type

matplotlib.pyplot.figure

mnist_classifier.visualizer.display_rf_feature_importance(cache, save_location: str = None)

Displays which pixels have the most influence in the model’s decision. This is based on sklearn,ensemble.RandomForestClassifier’s feature_importance array

Parameters
  • save_location (str) – the location to save the figure on disk. If None, the plot is displayed on runtime and not saved.

  • cache (dict) – the cache dict returned by the classifier. Must at least include [‘actual’, ‘prediction’] objects, each with [‘train’, ‘test’] arrays

Returns

the figure

Return type

matplotlib.pyplot.figure

mnist_classifier.visualizer.display_train_test_matrices(cache, save_location: str = None)

Displays the train and test confusion matrices

Parameters
  • save_location (str) –

  • location to save the figure on disk. If None (the) –

  • plot is displayed on runtime and not saved. (the) –

  • cache (dict) – the cache dict returned by the classifier. Must at least include [‘actual’, ‘prediction’] objects, each with [‘train’, ‘test’] arrays

Returns

the figure

Return type

matplotlib.pyplot.figure

Report Manager

Handles everything concerning reporting.

mnist_classifier.report_manager.load_test_suite_conf(filepath: str)

Loads a test suite json configuration file, and returns an array of parameters to pass to the argument parser. The JSON file should be formatted like the test_suite_example.json file in this repository. The names of the dict keys for each test should be

Parameters

filepath (str) – the filepath to the test suite JSON.

Returns

a list of parameter lists corresponding to the configurations of each of the tests to be run

Return type

list

mnist_classifier.report_manager.prepare_report_dest(report_filepath: str)

Prepares the destination output file. Checks if the location exists already, if it does, it creates a unique version with an auto_increment. So for example if inputted “my_report” and a folder “my_report” exists, a folder will be created called “my_report_1”

Parameters

report_filepath (str) – the target folder where the report should be created

Returns

the actual filepath that was created

Return type

str