by Douglas Sherman
To get the Math Theorems, we scraped the Wikipedia page List of Theorems. From here we recognized that there was a standard mw-content-ltr class in which the body of each page was located in. The following scapes the List of Theorems page, removes the header and footer, and then builds a data frame by scraping the pages at the links of each of the Theorems.
theorems[0:3]
Once we had the html of each of the theorems, we needed to clean this text substantially in order analyze it. To do this we first replaced any unicode titles with their string equivalents, then we trimmed any theorems that were part of obscure fields of Mathematics (for example Lie Algebra, Metric Geometry, or Elliptic Differential Equations). Still, we couldn't train the the html data as is. First, this data has all the html formatting tags such as <"p"> and <"div">. Beautiful Soup remove's these by the .get_text() function. This is demonstrated in the Trimmed column of the theorems data frame.
Second, this is Mathematics and as such it contains a substantial amount of Math in each of the pages. Not only is this math not easily translated into words, it is formatted by the standard TeX formatting which looks like \forall x \in \mathbb{R} to produce . To fix this we first removed the html formatting with get_text(), then used regular expressions to remove any brakets or other formatting that is present in a standard TeX script. However, this still left the above expression as forall x in mathbb R . Then to remove the extraneous words that were part of the TeX commands, we scraped a LaTeX glossary of commands and added these words to our collection of stopwords (used later for classification and word clouds). This glossary can be found here.
Next we used the nltk package to build word clouds and bar graphs of the most common words across all theorems in a given field of Mathematics. The stop words contained words like which, the, math, and most importantly, all the TeX words scrapped from the online glossary.
plotWordClouds()
Some of the fields' most common words are not very surprising. Such as distribution for Statistics or formula for Mathematical Logic. However, the Physics field produced quantum as the most common word, which indicates how important quantum is for Physics proofs. Also in Measure theory we see Lebesque, as in the Lebesque measure, which is interesting as this Measure is so important to measure theory that it trumps words like probability and function.
Next we measured the similarity of each field of Mathematics based on each fields' most common words. First we built a dictionary that looked at each word among all Theorems and found which Theorems shared that particular word among their top words. This dictionary is then turned into a data frame, below shows how many fields have the Keys as part of their top words. Below is a list of the ten most shared words among the fields and how many fields shared that particular word, or Key.
shared_words[0:10]
Once we had these shared words, we could compute a similarity metric between the Theorems. The metric we selected is as follows. For total fields and total words, the similarity matrix for the fields of Mathematics, , is given by,
where is the set of words for Field , is the th word, and is a scaling factor to force . With this metric we get values around [0,1] and the similarity is scaled by how many words are in each field. Below computes that similarity matrix and displays it as a graph of intensity.
import seaborn as sns
# Generate a custom diverging colormap
svm_cmap = sns.diverging_palette(220, 10, as_svm_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sim = pd.DataFrame(sim).rename(index=str,columns={x:y for x,y in enumerate(fields)}).set_index(keys=fields)
sns.heatmap(sim, svm_cmap=svm_cmap, vmax=1, cbar = True,
square=True, xticklabels=2, yticklabels=2,
linewidths=.5)
max_sim[0:10][['Max','Field_1','Field_2']]
The similarity matrix shows that there were a few Theorems that were very similar, such as Graph and Model theory, but the majority of pairs are blue with a similarity of less than 0.5. Some not so intuitive results are that Physics and Number theory were very similar, as well as Geometry and PDEs. Group theory is just Algebra, so to be categorized with Algbraic Geometry was not very surprising.
Next we built a classifer to predict the field of Mathematics that a theorem belongs to. We tested using bag of words or just the top words of each field with both Multinomial Naive Bayes and SVM models.
Using a bag of words as a feature set we check both the Multinomial Naive Bayes (MNB) and SVM models. The SVM model fits multiple classes in 1 vs all fashion as it is a binary classifier. Hence it checks each Theorem against each field and returns the most likely field.
temp = test_clf(new_duplicate,NB,None)
print 'MNB:',{x:round(y,4) for x,y in temp.items()}
svm_svm_cm = test_clf(new_duplicate,SVM,None)
print 'SVM:',{x:round(y,4) for x,y in svm_svm_cm.items()}
Since the accuracy was significantly lower than desired, but the SVM model was clearly better than the MNB model. We then tuned the SVM model's hyper-parameters using a grid search in order produce the best model.
plotGridSearch()
Next we tried running the models on just the top words of each field rather than the whole bag of words approach. Since this produces a new hyper-parameter, top_x, which dictates how many of the top words are used in the model, we used this hyper-parameter to train ROC curves. We built both ROC and Precision-Recall curves as well as plotted the accuracy for the various values of top_x.
plotMNB()
We defintely see improvement from the bag of words model, but the area under the ROC curve is abysmal and the accuracy could be better.
After determining the best values for rand_state, n_iter, and alpha we ran the tuned SVM model on just the top words of each field tuning the top_x hyper-parameter to build the ROC curve.
plotSVM()
Here we see that the accuracy and ROC curve are far superior with the topwords SVM model over the MNB model. This could be partly due to the fact that MNB relies on the features (words) being independent from each other which is not necesarily a good assumption. This is because a theorems choice of wording could affect all other words in the Theorem. For example, a linear algebra theorem could discuss a function as a map or an operator depending on if it is a theorem discusses the linearity of functions, a map between two vector spaces, or a Hermitian operator. In all these cases, the theorem should clearly be classified as Linear Algebra but has vastly different wording. The SVM model makes no such assumption, and it quickly weights non-informative features to zero, leaving only the big words playing a role in the classification. This is seen as the accuracy quickly rises to nearly 100% at only about 7 top words while the MNB model needs 27 to get close to the same accuracy. This is also reflected in the area under the ROC curve (AUC). For MNB this is only 0.091, significantly smaller than the ideal 1.0, while the SVM model gets much closer at 0.711. The reason for MNB's low AUC is that recall just never makes it past 20%. The recall, or true positive rate, is very lower for MNB. Since this value is computed of the recall of choosing each class as true, with the remaining set to false, this demonstrates the lack of granularity in the MNB model. For each of these recall calculations, we have only a handful true classes and mostly false classes, so the MNB model simply markes these as true instead of tweaking its weights to handle such a small number of responses. Both models had very lower Fallouts (or false positive rates), thus a misclassifed positive value was very unlikely. This is again most likely due to the 1 vs all method of multi-class classification; it was so unlikely to label anything as true, thus it was even more unlikely to label the wrong sample as true.
Following an article on the most "Beautiful" theorems of Mathematics from <a href=https://www.quora.com/Which-are-the-most-beautiful-mathematical-theorems-and-why target='_blank'>www.quora.com</a>, the following illustrates some traits of all these theorems as a collection. Quora asked people to consider the following when selecting theorems for the list
Category | Description |
---|---|
Generality | it is applicable to a wide variety of problems. |
Succinctness | it is expressible simply, in only a few words or equations. |
Originality | it expresses a surprising mathematical insight, or a connection between different areas of mathematics, that had not previously been widely suspected. |
Significance | it represents an important advance in mathematical knowledge, or resolves an important mathematical problem. |
Potency | it stimulates many new areas of mathematical research. |
Centrality | it is used in the proofs of many subsequent theorems. |
Independence | its proof depends on only a small number of previously established theorems, and preferably none. |
The theorems that were selected were
For this web page, the Theorems and which category they satisfy from the above list are bolded. Thus by scraping the page for the <"b"> tag we can get the relevant information after trimming the header (first 7 lines). Moreover, to make the theorem titles match the pre-existing data theorems frame, we had to do some serious trimming. This is done be first correcting any spelling errors, i.e. theoem instead of theorem, and stripping trailing white spaces as well as 'The' from each title.
Even after trimming the titles, there were still some that were names differently than our data frame. For example, Residue theorem was called Cauchy's Residue theorem. So we had to correct these titles before searching our data frame for their text. After fixing the titles we could join this list of Beautiful Theorems with the associated "reasons" for being beautiful with our pre-existing data frame.
BeautifulThms[0:5]
Then we plotted the most common words in all of these Beautiful Theorems.
plotBeautifulWordClouds()
There isn't much that can be gained from these plots, but one interesting feature is that the word "proof" was by far the most common word, nearly doubling second place. This is pretty clear when you consider that a theorem must be proven, but more importantly this didn't show up in the other word clouds. Thus this likely demonstrates that there is minimal similarity between the Beautiful theorems as it stretching to find a word that is shared between them.
To test how similar these Theorems were we computed the similarity between them just as before. Since this was from Theorem to Theorem, there are much less words to compute on, but still we were able to determine very similar Theorems and very dissimilar Theorems.
BeautySim[0:5]
Many of the theorems in this class are very similar, such as complex analysis, number theory, and mathematical logic. These fields are considered as part of the backbone of formal mathematics, which could explain why these were selected as "Most Beautiful". Moreover, being the backbone, and since independence was a category for selection, these don't rely on pre-existing Mathematics; leaving minimal phrasing for building these theories and making them similar.
Next we looked at the most common "reasons" for beautfulness among the Theorems. We plotted two sets of plots; one for all the theorems, and one for just the most similar theorems.
SimilarReasons()
There weren't many words to begin with, but there is a clear divide between the two sets as far as distributions go. For the second plot, significance started to play a much larger role than in the first. Moreover, originality fell from importance in the most similar categories. This makes sense, since if they were too original, they wouldn't be labeled as similar.
We then used our trained SVM model to learn more about the beautiful theorems. First we checked the accuracy of the model in correctly predicting their fields, and then we checked what field of Mathematics would be predicted by combining all the beautiful theorems into one.
beauty_svm_cm
Here wee see that the accuracy, precision, sensitivity, and specificity are all much less than the mean accuracy for the SVM model. This could be because there is a distinct difference between these "Beautiful" theorems and the standard formulaic theorems as a whole. Succinctness and originality were both categories in this list and both of these play against our model by not providing enough words and not providing similar words compared to other theorems in their respective fields. Moreover, theorems like the Fundamental Theorem of Algebra are about Algebra, but because these are the building blocks for Algebra all the algebraic tools built from them, so it requires the use of arithmetic and number theory in their proofs.
## Classify all the Beauty Theorems into one
beauty_all = sklearn.datasets.base.Bunch(
description="All Beautiful",
filenames="",
target_names="All",
data=[" ".join(beauty_df.data)],
target=-1)
pred_all = clf.predict(beauty_all.data)
{x:y for x,y in enumerate(fields) if x == pred_all}
Finally the overarching field that most represents all the Beautiful theorems as a whole in Mathematical Logic. Fitting as all of Mathematics pulls from this field. Hence, Mathematical logic would likely be a strong contender in a more subjective classification of the most beautiful theorems. Moreover, Mathematical Logic the independence reason for selecting a Beautiful Theorem leaves mathematical logic as the most likely field to pull from as it is the foundation of all other mathematics.
In the pre-processing step, we removed any theorems that were a part of some obscure field of Mathematics (per Wikipedia). The following will try to re-classify those Theorems into more appropriate fields so that they can be included with their fellow theorems. For example the field Quantum Theory really should be lumped in with Physics, and 'Several Complex Variables' is just Complex Analysis.
removed_thms.iloc[[1, 2, 5, 266, 279, 286]+range(10,15)][['Field','Title','Predicted']].sort_index()
From the table above we see that the predicted fields sometimes do not agree with what we would guess subjectively. It is not surprising that subjects like Lie Algebra, Queuing Theory, and Mathematical Series get predicted as Abstract Algebra, Stochastic Processes, and Analysis respectively. However, the classification of fields like Axiom of Choice, Neural Networks, and Quadratic Forms as Model Theory, Partial Differential Equations, and Number Theory, didn't agree with our first guesses of their parent fields. This is a demonstration that this isn't a classification of the field itself, but the theorem that Wikipedia classified as that original field. For example the Neural Network theorem was the "Universal approximation theorem" which describes
"In the mathematical theory of artificial neural networks, the universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of , under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.
One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions.[2]
Kurt Hornik showed in 1991[3] that it is not the specific choice of the activation function, but rather the multilayer feedforward architecture itself which gives neural networks the potential of being universal approximators. The output units are always assumed to be linear. For notational convenience, only the single output case will be shown. The general case can easily be deduced from the single output case."
~Universal Approximation Theorem
Now the wording of this quote starts to provide insight into why this was classifed as a PDE theorem. This provides some valuable uses for this prediction algorithm. First of all, the publisher of these Theorems only thought of the single use of the intended Theorem, and using this prediction model we can find other fields of Mathematics that these Theorems support. Moreover, if someone studies a very specific Mathematical idea, by predicting which Field this falls under, it opens up a source for other similar ideas under that field. Under the same idea, this allows publishers of Mathematics papers to consider who else might be interested. For example, the Neural Networks researcher would realize that PDE researchers might be interested and publish in an appropriate journal to reach them.
Please visit the Full Page to see the actual calculations of the above summary.