Data points are x_{i} (i=1,...,m). In the applet, they are situated on the 607×400 workdesk.
The first
principal component. 
Principal component analysis (PCA) is the most popular method for data approximation by straight lines and planes, and for dimensionality reduction. PCA was invented by Karl Pearson. In 1901 he wrote: "In many physical, statistical, and biological investigations it is desirable to represent a system of points in plane, three, or higher dimensioned space by the "bestfitting" straight line or plane." The first principal component is the straight line that minimizes the sum of the squared distances from the data points. It is the least squares approximation of the data set by a line. The closest approximation is equivalent to the widest scattering of projections because of the Pythagorean theorem. Therefore, the first principal component maximizes the variance of projections of the data points on the straight line.To find the second principal component, it is sufficient to subtract from data vectors their projections on the first principal component and then find again the best straight line approximation, and so on. In this applet, we use the simple iterative splitting algorithm for calculation of the "bestfitting" straight line.
Two initial nodes, y_{1}, y_{2}, are defined by the user before learning. The data points x_{i} (i=1,...,m) are given.
Further reading
The SelfOrganizing Map (SOM), also known as the Kohonen network, is a computational method for the visualization, lowdimensional approximation and analysis of highdimensional data. It works with two spaces: a lowdimensional space with a regular grid of nodes and the higherdimensional space of data. In our applet, the data space is twodimensional (a plane) and the nodes form a onedimensional array.
Radius 3 linear
Bspline 
Each node n_{i} is mapped into the dataspace. Its image, y_{i} is called the coding vector, the weight vector or the model vector. Rather often, the images of nodes in the data space are also called "nodes", with some abuse of language, and the same notation is often used for the nodes and the corresponding coding vectors. The exact sense is usually clear from the context.
The data point belongs to (or is associated with) the nearest coding vector. For each i we have to select data points owned by y_{i}. Let the set of indices of these data points be C_{i}: C_{i}={l:x_{l}y_{i}^{2}≤x_{l}y_{j}^{2} ∀i≠j} and C_{i} be the number of points in C_{i}.
The learning of SOM is a special procedure that aims to improve the approximation of data by the coding vectors and, at the same time, approximately preserves the neighbourhood relations between nodes. If nodes are neighbours then the corresponding coding vectors should also be relatively close to each other. To improve the approximation of data, each coding vector y_{i} moves during learning towards the mean point of the data points owned by y_{i}. It involves in this movement the neighbours to preserve neighbourhood relations between nodes. This Batch SOM learning rule was proposed by Kohonen. The learning algorithm has one parameter, the "Neighbourhood radius" h_{max}. It is necessary to evaluate the neighbourhood function and this should be assigned before learning. For this applet, the neighbourhood function, h_{ij}, has the simple Bspline form: 1h_{ij}=ij/(h_{max}+1) if ij=h_{max} and h_{ij}=0 if ij>h_{max} (the example is given on the right).
1D SOM as a
broken line. 
The initial location of coding vectors should be assigned before the learning starts. There are three options for SOM initializations:
· The user can the select coding vectors randomly from datapoints;
· The coding vectors may be selected as a regular grid on the first principal component with the same variance;
· The coding vectors may be assigned manually, by mouse clicks.
All learning steps are saved into "Learning history". If learning restarts then the last record of Learning history is used instead of the user defined initial coding vectors.
Before learning, all C_{i} are set to the empty set (C_{i}=∅), and the steps counter is set to zero.
If we connect the coding vectors for the nearest nodes by intervals then SOM is represented as a broken line in the data space. To evaluate the quality of approximation, we have to calculate the distance from the datapoints to the broken line and evaluate the Fraction of variance unexplained (FVU).
Approximation of a spiral by GSOM; 65 nodes,
Approximation of a spiral by GSOM; 152 nodes.
Approximation
of a spiral by GSOM.
All attempts to
approximate a spiral by 
Further reading
The growing selforganizing map (GSOM) was developed to identify a suitable map size in the SOM. It starts with a minimal number of nodes and grows new nodes on the boundary based on a heuristic. At the same time, as a byproduct, we can sometimes get better approximation properties (see an example on the right). There are many heuristics for GSOM growing. Our version is optimized for 1D GSOMs, the model of principal curves. GSOM method is specified by three parameters. Their values should be set before the learning starts:
· "Neighbourhood radius"
this parameter, h_{max}, is used to evaluate the neighbourhood function, h_{ij} (the same as for SOM).
· "Maximum number of nodes"
this parameter restricts the size of the map.
· "Stop when fraction of variance unexplained percent is less than..."
The threshold for FVU value.
The GSOM algorithm includes learning and growing phases. The learning phase is exactly the SOM leaning algorithm. The only difference is in the number of learning steps. For SOM we use 100 batch learning steps after each learning start or restart, whereas for GSOM we select 20 batch learning steps in a learning loop.
The initial location of coding vectors should be assigned before the learning starts. There are three options for GSOM initializations:
· The user can select coding vectors randomly from datapoints;
· The coding vectors may be selected as a regular grid on the first principal component with the same variance;
· The coding vectors may be assigned manually, by mouse clicks. We use n for the number of nodes in the map.
All learning steps (vectors y_{i}) are saved into "Learning history". If learning restarts then y_{i} from the last record of the Learning history are used instead of user defined initial coding vectors.
Beforethe start, set the step counter to zero and the sets of data points associated with nodes to the empty sets, C_{i}=∅.
Further reading
The applet contains several menu panels (top left), the University of Leicester label (top right), work desk (center) and the label of the Attribution 3.0 Creative Commons publication license (bottom).
To wisit the website of the University of Leicester click on the University of Leicester logo.
To read the Creative Commons publication license click on the bottom license panel.

The first menu panel allows you to create and edit the data set. Every data point is displayed as a small circle on the work desk.
The first part of the Data set panel contains one big coloured rectangle and a palette built from six small differently coloured rectangles. To select one of the six colors click on a small rectangle. The big coloured rectangle displays the currently selected color. In this applet, this is just the colour of the cursor. The colours are used for data points in the applets Kmeans and Kmedoids and KNN and Potential Energy to distinguish points from different classes and clusters.
The third part of the Data set panel contains six buttons. The first three buttons change the type of cursor brush.
· The button One point switches the brush to add single points. Every mouse click on the work desk will add a point to the data set.
· The button Scatter allows you to add several points by one mouse click. You can choose the number of added points in the Number of points spinner at the middle part of the menu panel. Points are scattered randomly in a circle which radius is determined by the slider Caliber in the middle part of menu panel.
· The button Erase switches the mouse cursor to the eraser. When you click the eraser mouse cursor on the work desk, all points, whose centres are covered by the cursor, are removed from the work desk.
· The button Select is not used in this applet.
· When you press the Random button several points are added on the work desk. The number of points is defined by the Number of points spinner. The locations of new points are generated randomly on whole work desk.
· The button Clear all completely clears the work desk. It removes all kinds of object: data points, centroids, test results, maps and so on.
The All methods menu panel containes two parts.

The right part, Learning history, contains five buttons, one field for data input and two information fields.
· The leftmost button shows the first step. The second button moves to the previous step.
· The third button starts the slide show with delay between steps selected in editable field Delay (ms). It goes always from the first step to the last one.
· The fourth button moves to the next step. The rightmost button shows the last step. The information field Step # shows the number of the current step.
· The information field from indicates the total number of learning steps in the learning history.
The left part of all methods menu panels contains the name of the method, two fields to show the number of nodes, and six standard buttons.

· The Random init button serves to add a coding vector at randomly selected data point to the current set of coding vectors.
· The PCA init button serves to add a new coding vector on the first principal conponent line. All coding vectors are distributed on the first principal component line equidistantly. The variance of this equidistribution on a segment is selected equal to the total variance of the data points.
· The Remove initialization button removes all nodes.
· The Options button opens options dialog for customizing method parameters. This dialog is specific for every method.
· The Learn button starts the learning process, specific for every method.
· The Show error/Hide error button shows/hides the tableau, which contains the names of the methods and the corresponding errors (Fraction of variance unexplained,). The tableau is located at the right bottom corner of the work desk.
Use the menu panel PCA to work with principal component analysis.

For PCA, all points and lines are in a blue color. The points for PCA have the form of fourray stars (see left). To start PCA learning two points are needed. You can select initial points by mouse clicks at choosen locations, or by the button Random initialization. When you try add the third point, the first point is removed.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results. You will see one fourray star, the centroid of the data set, and one line, the current approximation of the first principal component.
Use the menu panel SOM to work with self organizing maps.

For SOM, all points and lines are in a red color. The points for SOM have the form of nineray stars (see left). To start SOM learning, the initial positions of all coding vectors are needed. To add a point to an initial approximation, click the mouse button at the choosen location, or use the button Random init or the button PCA init. When you add a new point, it is attached to the last added point.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results.
Use the menu panel GSOM to work with growing self organizing map.

For GSOM, all points and lines are in a green color. The points for GSOM have the form of nineray stars (see left). To start learning, GSOM is needed to initiate one point. To place initial point click the mouse button in the chosen location, or push the button Random init or the button PCA init. When you add a new point, the new point is joined with the last added point.
After learning is finished, use the Learning history buttons at the right part of menu panel to look at the learning history and results.
The menu panel Std. data is needed to work with some standard data set.

· The first button Clear all comletely clears the work desk. It removes all kinds of objects: data points, coding vectors and so on.
· The button Scatter serves to scatter existing data points. The scattering procedure adds a specified number of new data points in a specifyed circle around every data point, which exists. Options of scattering are defined in dialog (see left).
· The Number of points to add parameter defines the number of points to add to every data point.
· The Scattering radius parameter defines the radius of the circle around each data point where these new data points should be equidistributed.
· The last button Save data saves all data point locations to a text file named Data for mirkes.txt. In order to add this data set to the Std. data panel file with a short description you needed to send this file by email to Evgeny Mirkes, emmirkes@gmail.com. Warning: To use the button Save data you should change settings of you browser security system, or start this applet offline without brouser.
Any other button sends a stored ("standard") image to the work desk.
We approximate data by a straight line (PCA) or by a broken line (SOM and GSOM). The dimensionless least square evaluation of the error is the Fraction of variance unexplained (FVU). It is defined as the fraction:
[The sum of squared distances from data to the approximating line line/The sum of squared distances from data to the mean point].
The distance from a point to a straight line is the length of a perpendicular dropped from the point to the line. This definition allows us to evaluate FVU for PCA. For SOM and GSOM we need to solve the following problem. For the given array of coding vectors {y_{i}} (i=1,2, ... m) we have to calculate the distance from each data point x to the broken line specified by a sequence of points {y_{1}, y_{2}, ... y_{m}}. For the data point x, its projection onto the broken line is defined, that is, the closest point. The square of distance between the coding vector y_{i} and the point x is d_{i}(x)=xy_{i}^{2} (i=1,2, ... m).
Projection of a data point onto a segment 
Let us calculate the squared distance from the data point x to the segment [y_{i}, y_{i}_{+1}] (i=1,2, ... m1). For each i, we calculate l_{i}(x)=(xy_{i} , y_{i}_{+1}y_{i} ) /y_{i}_{+1}y_{i}^{2} (see left).
If 0<l_{i}(x)<1 then the point, nearest to >x on the segment [y_{i}, y_{i}_{+1}], is the internal point of the segment. Otherwise, this nearest point is one of the segment's ends.
Let 0<l_{i}(x)<1 and >c be a projection of x onto segment [y_{i}, y_{i}_{+1}]. Then c  y_{i}^{2}=(l_{i}(x)y_{i}y_{i}_{+1})^{2} and, due to Pythagorean theorem, the squared distance from x to the segment [y_{i}, y_{i}_{+1}] is r_{i}(x)=xy_{i}^{2}(l_{i}(x) y_{i}y_{i}_{+1})^{2}.
Let d(x)=min{d_{i}(x)  i=1,2, ... m} and r(x)=min{r_{i}(x)  0< l_{i}(x) <1, 0>< i< m }. Then the squared distance from x to the broken line specified by the sequence of points {y_{1}, y_{2}, ... y_{m}} is D(x)=min{d(x), r(x)}.
For the given set of the data points x_{j} and the given approximation by a broken line, the sum of the squared distances from the data points to the broken line is S=∑_{j}D(x_{j}), and the fraction of variance unexplained is S/V, where V=∑_{j}(x_{j}  X )^{2} and X is the empirical mean: X=(1/n)∑x_{i}.