I’m sure that every single person in AM221, perhaps every single person at Harvard, has heard of and used Wikipedia sometime in their life. The online encyclopedia serves as a vast resource on topics both extremely well documented and obscure. The most interesting aspect is the fact that the entries are user created and user administrated. That is to say, I can go to Wikipedia right now, and create a brand new page on a topic personal to myself (of course this is assuming the matter is important enough for other people to approve and the information I present are factual). But with this vast array information, comes the challenge associated with big data, such as how to properly classify and link together the billions of articles written and updated each day.
Wikipedia uses hierarchies as the classification method for the organization of text documents. Imagine an article on the topic of Noah’s Ark; it would be classified under over-branching hierarchies such as Bible, Christianity, Religion. Of course not every article is so easily classified, and not every classification falls in such nice inclusive manner. Therefore, along with the widespread use of hierarchies, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data (imagine an obscure topic being created that in turn requires a new category all together); as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.
For our project, we are given a list of Wikipedia hierarchies as well as a set of training and test data. The task, obviously, is to use the training set to create a classifier that can accurately categorize the test set. Some of the issues associated with a classification of this magnitude include cyclic hierarchies and memory capacity, but with careful cleaning of the data as well as the use of sparse matrix, the problem can be represented just as standard classifier problem. Most previous attempts have included the use of KNN (k-nearest neighbors) and a Multinomial Naive Bayesian Classifier, but we think that this problem falls nicely under linear programming as well as the use of SVM to create a classifier. We will attempt this non-probabilistic approach and see how it stacks up to the probabilistic classifiers often used in statistics.