Final Project: Hierarchical Classification on Wikipedia Articles


For our final project, James Yu and I (Irene Chen) examined hierarchical text classification and its application on Wikipedia article data. Inspired by a related Kaggle challenge, we were interested in expanding beyond our problem set classifying Twitter data into binary categories: instead, each article can be classified as one or many categories. As James explained earlier, Wikipedia uses hierarchies as the classification methods for the organization of text documents. That is, an article can be labeled as “France” or “Germany” and knowledge about which of those country labels it has makes it more or less likely that “Paris” or “Berlin” is labeled. It is important to note that not every category nests neatly, but hierarchical classification algorithms use pre-existing knowledge about classifications to make category classifications. In our project, we wanted to improve upon multiclass SVM performance by incorporating principles outlined by Wang and Casasent (2008).

Hierarchical Model

Although hierarchical models have the potential to become much more complex, we implemented an exploratory approach by training a classifier on higher-level categories (of which there are fewer) so that we could detect earlier on if an article does not belong to a higher-level category and therefore discard all lower-level categories. As we can see the image below, the Wikipedia dataset to which we applied our model consists of a few higher-level categories and many lower-level leaves.


Visualization of Wikipedia article categories and hierarchical links as graphed by Gephi

We take a bottoms-up approach to resolving the hierarchy structure of the classes. Starting from each leaf node c_i \in C, we determine $latex p_{ij}^1 \in P : p_{ij}^1 \to c_i$, this is to say we find the set of corresponding one-level-up parents to each leaf node $latex c_i$. The process can be repeated until each path reaches a terminal parent node so that

$latex p_{ij}^k = \emptyset : p_{ij}^k \to p_{ij}^{k-1}$

After reaching all terminal parent nodes, then by transforming the result classification vector from $latex Y = C \to P^k$ where $latex k$ is the highest degree of the terminal parent, we can train our SVM using the transformed vector to get a classifier that will map feature vectors of the test set onto posterior probability space. The overall idea here is that for each node, we can convert the outputs from the classifier into posterior probabilities. Based on some threshold requirement, we can determine which path along the hierarchy tree to follow, and thus ultimately arrive at the final leaf classes. In the case of an SVM, we can map the SVM outputs for the validation set data to a probability $latex p \in [0,1]$. To reduce the effect of outliers, we can use a sigmoid mapping

$latex P(y=1|t) = \dfrac{1}{1 + \exp(at+b)}$

where $latex t$ is the SVM output, and $latex y$ is the known class label. To estimate $latex a,b$, we can use the maximum likelihood estimate since the validation data is drawn independently.

$latex \arg \max_{a,b} \prod_i P(y=1 | t^i,a,b)$

Pragmatically, we can maximize the log likelihood and use gradient ascent to find $latex a,b$. Further, the leverage of the hierarchy structure allows us to avoid calculations of the feature vectors we already ruled out of the previous level parent, and thus each subsequent descent of the hierarchy structure requires less calculations. Nevertheless, this approach is still afflicted with the curse of dimensionality due to the sheer number of features as well as the depth of the hierarchy and size of the class set. Since our goal is ultimately to correctly classify the feature vectors based on leaf classes only, once we arrive at $latex P^1$, the one-level-up parent from the leaf class, then again based on threshold requirements we narrow down the samples of feature vectors. We perform a multilabel classification scheme upon the narrow set of feature vectors to receive the final classification.

One thing to note that this approach, while more computational intensive, gives the flexibility of classifying according to all nodes in the hierarchy, not just the leaf nodes, which could prove useful in another context. As we consider how to apply this to our dataset give our memory constraints (see below), we chose to only use the hierarchical approach for $latex k=2$ to avoid computational intensity but at the same time compare the usefulness and accuracy of the method.

Due to the limitation in sample size, as well as the vastness of the hierarchy tree, it appeared that almost all leaf classes had a set of parents, where the number of parents any leaf node had in common with other leaf nodes in the sample was minimal. This of course does not aid us in reducing computational complexity. To simply the problem further, we determined the $latex p$ most frequent parent and discarded the remainder, transforming our result vector to reflect only $latex p$ parents. This transformation allows us to finally leverage the hierarchy structure and thus narrow our feature vector set with the process described above.

Evaluation Metrics

We use the Macro F1-score to measure the performance of all the methods. Macro F1-score is a conventional metric used to evaluation classfication decisions. Let $latex tp_{c_i}, fp_{c_i}, fn_{c_i}$ be the true positives, false positives, and false negatives respectively for class $latex c_i$. The macro-averaged F1 $latex MaF$ is

MaF = \dfrac{2 \cdot MaP \cdot MaR}{MaP + MaR}
$latex MaP = \dfrac{\sum_{i=1}^{|C|} \dfrac{tp_{c_i}}{tp_{c_i} + fp_{c_i}}}{|C|} $
$latex MaR = \dfrac{\sum_{i=1}^{|C|} \dfrac{tp_{c_i}}{tp_{c_i} + fp_{p_i}}}{|C|}$

Note that $latex MaP$ and $latex MaR$ correspond to the precision and recall of the classification method, respectively. Precision ($latex MaP$) refers to the number of correct results divided by the number of all returned results while recall ($latex MaR$) is the number of correct results divided by the number of results that should have been returned.

Our chosen evaluation metric MaF can then be interpreted as the weighted average of the precision and recall where the Macro F1-score is at best 1 and at worst 0.


On our reduced subset of the Wikipedia dataset, we found that the unadjusted sample performed with a MaP = 0.066, the Multiclass SVM model performed with MaP = 0.644, the simplistic Hierarchical model performed with MaP = 0.224. Although our hierarchical model did not perform as well as expected, we believe with a more significant selection of data or more advanced exploitation of the hierarchical structure (e.g. k = 3), it is possible we could achieve better results; however, because of the constraint on computational power, we were not able to explore that realm.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s