In the above example, by looking at the features extracted from different robot pictures we can say that shape of the head is a distinguishing dimension. However, we cannot say that for eyes since they look very similar.
An embedding is a mapping from discrete values (e.g. words, observations) to vectors. You can use different embedding techniques to visualize the lower-dimensional representation of your data set. Embeddings can be in hundreds of dimensions. The common way to understand them is to project them into two or three dimensions. They are useful for many things:
You can use them to explore the local neighborhoods. You may want to explore the closest points to a given point to make sure that those points are related to each other. You can select those points and do further analysis on them. You can use them to understand the behavior of your model.
You can use them to analyze the global structure. You may want to find groups of points. This can help you to find clusters and outliers in your data set.
There are many methods for obtaining embedding:
Principal component analysis: This is an effective algorithm to reduce the dimensionality of your data, especially when there are strong linear relationships among variables. It a linear deterministic algorithm to capture the data variation in as few dimensions as possible.
It can be used to highlight the variations and eliminate dimensions. You can retain the first few principal components that consider a significant amount of variation if you want to interpret the data. The remaining principal components account for trivial amounts of variance. Thus, they should not be retained for interpretability and analysis purposes.
T-distributed stochastic neighbor embedding (t-SNE): T-SNE is a dimension reduction algorithm that tries to preserve local neighborhoods in the data. It is nonlinear and nondetermistic. You can choose to compute 2 or 3D projections. T-SNE can find structures that other methods may miss.
It is very useful to visualize and interpret the datasets if you know how to use them. But there are many things that you need to be careful about. While preserving the local structure, it may distort the global structure. If you want more information about what you need to avoid about t-sne, there is a great article at distill.pub titled, How to Use t-SNE Effectively. You should definitely check it:
Using t-SNE embeddings can help you to reduce the dimension of the data and find structures. However, if you have very large data set, understanding the projections can still be hard. You may want to check the geometry of the data to get a better understanding of the data set.
Topological data analysis (TDA)
Topology is the field that studies the geometric features that are preserved when we deform the object without tearing it. Topological data analysis provides tools to study the geometric features of data using topology. This includes detecting and visualizing the features, and the statistical measures related with those. Here geometric features can be distinct clusters, loops and tendrils in the data. If you have a loop in this network, you can conclude that there is a pattern that occurs periodically in this data set.
Mapper algorithms in TDA are very useful for data visualization and clustering. You can create topological networks of your data set in which nodes are the group of similar observations and the edges connect the nodes if they have an observation in common.
Understanding and interpreting data is a very crucial step for machine learning. In this blog post, we tried to provide an overview of techniques that can help you to better know your data.
Depending on the size, dimension and type of your data, you can choose the algorithm. For instance, when you have big raw data, you can use representative examples instead of random samples. If you have a wide data set, you can also find the important dimensions to understand the representative samples.
Different techniques can give you different insights about your data. It is your job to use the tools to solve the mystery like a detective.
This is the second post in our interpretability series. In future posts we’ll cover interpretability techniques for understanding black box models, and we’ll look at recent advances in interpretability.
This article has been republished from SAS.