Upload
The IMGT/V-Quest server provides you with a lot of useful data and annotation. However, you might want to obtain additional information, which is what the "Calculation" branch implements. On the left-hand side, you can upload your data arranged in a spread-sheet by clicking "Browse". You may have obtained it from the "IMGT/V-Quest" branch of this server or from any other source - as long as you specify the format settings properly (e.g. whether the separator is a comma or tabulators), you should be able to load any well-defined spread sheet. For this tutorial, you should now load the example input we have prepared for you. You can either download it first and do the upload manually afterwards or load it directly by clicking on the respective link. For most tasks, you will see a progress bar which monitors the current status of the calculation. On the right hand-side, you see the contents of the uploaded file. Note, that the (formerly) grey buttons on the top have now been enabled. You may now proceed to the "Extract" tab.
Extract from columns
Sometimes the format of the data in a column is not very convenient. Consider the "V.GENE.and.allele" column for example. One important piece of information it holds is the V-gene family, which would be "IGHV1" in the observations shown here. However, IMGT/V-Quest also states the species which is homo sapiens in this example, and the gene and actual allele used. We might want to compare properties of the V-genes grouped only by their family so we might want to extract the family from the column and store it into a new one. For that purpose, "BRepertoire" provides this tab. On the left hand side you see a drop-down menu where you can select the original column you want to process. Please select "V.GENE.and.allele" now. You can either use the "String indices" method or the "Split string" one. Let us first have a closer look at the "String indices method" now. With the slider, you can set the start and end positions you would like to cut out. Note, that these indices are inclusive and entries that are shorter than your selected range will produce an "NA", "not-assigned" value. Please use the slider to set it to 8 to 12. Another remark: the string "Homsap" that we skip has only 6 letters but the following blank character or space is counted as well, which is why we do not start at 7. Finally, set a name for the new column which will be attached at the very rear of the table. In this tutorial, we will set it to "Vfamily". Now, please click on "Extract" to start the process. Thus, we have extracted the proper section from the strings in column "V.GENE.and.allele" and deposited in our new column. However, what if we want to extract parts of different lengths? Imagine a dataset in which you also have an "IGHV11" entry - by the method applied above, this would be cut to "IGHV1". For that reason, another, more sophisticated method is provided. Select the column again and set the method of extraction to be "Split string". This interface allows you to set one or multiple separators by which the strings are split and to select a fraction by a number. In our case, we would like to split it by a "blank" character, because that is what separates "Homesap" from the V-gene and by a "-" sign, because that is what separates the V-family from the gene and allele information. Note, that every occasion of the specified separators, which can be strings as well, are used for splitting. Now we need to select the proper fragment. In our case it will be the second one. We need to change the name of the new column, because we have already generated a column with name "Vfamily". Let us call it "Vfamily2" for now. Again, if you click "Extract", the calculation starts and the result is attached at the end. With this you have completed the extraction tutorial and you may now proceed to the property calculations.
Calculation of physico-chemical properties
For every framework region or CDR in your dataset, you can calculate features such as hydrophobicity, amino acid length, the fraction of charged, polar, aromatic, etc. amino acids or the Kidera factors to describe your sequences on a more basic, chemical level. These properties will be added as additional columns at the end of the table. For this tutorial and the associated test input, only one column holding amino acid sequences is available, namely "CDR3.IMGT", the CDR3 of the heavy chain. When you click on "Calculate", the processing starts and the status is shown by a progress bar. Depending on the size of your dataset, this might take a while so please hang on and do not close the browser in the meantime. After completion, you can download the table by clicking on "Download data" and use this file as input for the data analysis branch of BRepertoire if you like. You can also import CSV files into programs like Microsoft Excel and LibreOffice Calc.
Clonotype clustering
During the activation of immune cells, some clones are subject to expansion (for example in response to a vaccination), while others are not and therefore are comparatively rare. Therefore, an analysis simply taking all sequences into account will be skewed towards these dominant clones (at least in many cases). "Clonotype clustering" can be applied to identify these clones by grouping similar sequences together, based on their similarities. Afterwards, a representative observation per clone can be used to treat every clone with the same weight, irrespective of its actual number of members. This allows to assess the diversity in a set of sequences with a much higher precision than by simply taking all sequences.
Usually, DNA sequences are used (instead of AA sequences) since they hold more information and do not show large changes due to single nucleotide indels, which may skew the results. For our tutorial, you will need to extract the "Vfamily" information from IMGT's "V.GENE.and.allele" column first, please have a look at the respective tutorial (see above).
For this tutorial, please set the interface as follows: 1) Use column "DNA_junction" as the column holding sequences, 2) Enable "Partition data (first column)" and select column "Vfamily", 3) Select the "Levenshtein distance" as the distance metric in use and set the threshold to 0.18 (which is appropriate for heavy chain sequences) and 4) Enable "Calculate representative clone member" and "Include amino acids", set the latter to column "CDR3.IMGT". Now please click on "Calculate" to start the clonotype clustering. The resulting columns, whose names can also be set in the interface, are attached to the rear of the table once the calculation is complete.
Introduction / Upload
Welcome to the tutorial for the data analysis branch of BRepertoire. As usual, the interface to set the options is on the left-hand side and the results are displayed on the right next to it. In this first step, you need to provide the data to be analysed by uploading a CSV file from your local machine. This file can either be generated by exporting your data from various programs such as Microsoft Excel or Libre Office Calc, programming languages such as R or by using the "Manage IMGT/V-Quest output" function on this server. Most often, the entries in these text files are separated by commas and strings are enclosed by double quotes, but you can set this according to your file if necessary. By clicking on "Browse", you will be able to select the file of choice from your hard drive. Note, that currently the upload size limit is 256 MB, to enable fast server response times. You can find this and other hints in the notes on the respective sites. Also, by moving the mouse cursor over the blue question marks, help texts pop up to provide further explanations. For the sake of this tutorial, we will use an example file that has been prepared using the "Management IMGT/V-Quest" and "Calculation" branches of BRepertoire. You can download it by clicking "here", as stated by the respective help text, or load it directly to the server. We will do the latter now, but feel free to have a look at the file to get an idea of the format. Note, that whenever you perform an action that might take a while, a progress bar pops up that informs you of the current status and steps performed. Please wait before further input while these actions are executed to avoid "jumps". After loading and parsing the data, which includes the removal of special characters from labels to ensure proper data handling afterwards, the data is shown as a table on the right-hand side. The data can be sorted by the column values for visual inspection and the bottom of the table offers a navigation interface. This will have no effect on the subsequent analysis, although it is usually a good idea to make sure that the columns look as you would expect them to. At the bottom, you see how many rows your dataset contains. The names of the columns are not pre-defined as long as they are unique, which is in contrast to the fixed column names used by IMGT/V-Quest. Note, that the button "Select" has now been enabled and so when you are satisfied with the upload, please proceed to this step.
Select
In the first tutorial, you have loaded the repertoire example data - now you need to select the parts of the data that you want to analyse later. It might be that there is much more information available than is currently of interest. For example, the input file we have provided you with has columns for "Sequence ID" and "CloneID" which we will not need in the course of this tutorial. To specify the data necessary to follow this tutorial, you should now select the columns "Patient.ID", "Sample.ID", "Age.Group", Vfamily", "Jfamily", "PrimaryDfamily", "Pepstats_length" and the ten Kidera factors in the left-hand menu. You can see that the table on the right side is dynamically updated when you add or remove columns. If you want to keep all columns or delete your current selection, there are two buttons facilitating that on the top. It also becomes clear that there are different types of values, namely categorical or string columns (for example, "Vfamily") and numerical columns (for example the Kidera factors). Some analysis functions later on may accept either one or both types, depending on their input requirements. The data can be downloaded by clicking on the "Download data" button at the usual location. Note, that if at least one column is selected, the "Filter" and "Grouping" buttons will be enabled. The "Filter" step is optional and allows you to filter out certain values within the columns. For details please see the filter tutorial. In contrast, data grouping is mandatory and is described in a later tutorial.
Filter
The "Filter" step is optional and allows the selection of certain values for the columns comprising the data set. This function operates on the selected dataset from the step before, so only columns you have selected are shown in the left-hand panel. You can set value ranges for multiple columns and only rows that match all requirements will be retained. When you start the filtering, the table on the right hand-side is equivalent to the one in the selection step because you have not specified any filtering rules yet. You can check the number of entries at any given time to monitor how the size of your data set is affected by the filtering. The first step is to select the columns you want to filter. In the course of our tutorial, we will select "Vfamily" and "Pepstats_length", the former being the type of V gene and the latter the number of amino acids in the CDR3 region of the heavy chain. For every selected column, the data type can be specified. Usually, the server guesses the type correctly and displays non-numeric values as a checkbox-list of the values (for example, "Vfamily") and numeric values as a range-slider (for example, the number of amino acids). In our case, we will deselect the value "IGHV7", since we only want to analyse V families 1 to 6. We also set the number of amino acids required to be at least 3. You can see, that the number of rows or observations in the data set has been reduced slightly since the filtering is put into effect immediately. The next step is the data grouping.
Grouping
Prior to any analysis, grouping columns have to be specified to allow comparisons. For example, if you select "Age.Group" and the values are either "Old" or "Young", it will be possible to compare the data specifically between them. In our example, we will select "Patient.ID", "Sample.ID" and "Age.Group". On the right-hand side you will see the number of observations or rows, per sub-partition of your data. Typically the columns selected for grouping will not be used for the analysis itself. As you can see, the default order of the post-vaccination days in column "Sample.ID" is "Day 0", "Day 28" and "Day 7". We can set it to a more logical order: In the left panel you can sort the data for the grouping columns by moving the values from left to right and vice versa. Make sure that all the levels you want to use are in the respective right field before proceeding. The order of the levels will be used for the plots afterwards but do not affect the results in any other way. Also note that columns with more than 100 levels are not available for grouping because the data would be partitioned into very small groups and become very sparse. When you are satisfied with the order and selection made, you may now proceed to the enabled analysis tabs. The next tutorial will show the "Box-Whisker plot" function.
Box-Whisker plot
The "Box-Whisker plot" is a fairly common representation of numerical data. To start, please have a close look at the interface on the left. On the top, the available statistical tests are listed. Currently, only the Wilcoxon rank sum test is implemented. For further statistical analyses please have a look at the "Distribution analysis" tab. Below you see the grouping level interface, that allows you to partition the data. You will recognize the three grouping columns we have specified in the grouping step before. Your data will be split according to the values in these columns, for example we have two levels in column "Age.Group" namely "Young" and "Old" and three levels in column "Sample.ID", namely "Day 0", "Day 7" and "Day 28". The data will therefore be split into (2x3=) 6 partitions or boxplot elements in this case. Remember that only the groups in the right-hand box are considered. In our tutorial, we will use these groups in the order "Sample.ID" and "Age.Group". Most functions require at least one and take a maximum of two grouping levels at the same time, which is stated below the respective selectors. In our example, if we had only used "Age.Group" for grouping, we would get a plot where only two boxplot elements are used, one for the "Young" and one for the "Old" group, respectively. The next step is to select the colours for plotting. Note, that the second grouping column in the list is used for colouring if you have specified two, which is "Age.Group" in our case. You can select a colour for every level in the data set and we can change the colour for "Young" to be blue by clicking and selecting from the palette. Below this you have to specify the property that you want to use in your boxplot. Note, that only numerical columns are shown here and grouping columns are excluded. We will select the number of amino acids in the CDR3 of the heavy chain. Below this you have some graphical options, for example whether you want to add a legend or not. By clicking on "Generate Plot" you start the calculation and plot generation, which you can monitor by following the progress bar. As specified, our plot shows the CDR3 length distributions for the "Old" and "Young" groups for the data recorded for "Day 0", "Day 7" and "Day 28", respectively. The first grouping level "Sample.ID" splits the x-axis and the second one is used, as mentioned before, for the colouring. You can see immediately, that the starting distributions for both "Young" and "Old" are very much alike, while the response to vaccination at "Day 7" is definitely different. Three weeks later the values are similar to those at day 0, indicating the normalization of the repertoires. In our example, this might be interpreted as a hint that elder people react differently to this specific vaccination than younger ones in terms of the lengths of the CDR3 of the heavy chains. For all analysis tasks, the appropriate download buttons will be displayed on top of the right-hand tabs when ready, in this case including p-values. Note, that some plots might look a bit different when downloaded because a different engine is used to generate them. Alternatively, you can download the plot you see by a right-click or by taking a screenshot. The next tutorial will deal with the "Bar plot" function.
Barplot
In the previous tutorial, we have examined the differences in old and young people in terms of the response to a vaccine. To this end, we have used the length of the CDR3 part of the heavy chain in amino acids. This can also be illustrated using the "Bar plot" function. On the left, we have the grouping level interface and we will set it to use "Sample.ID" and "Age.Group" as before. Remember that you can change the order of grouping levels in the grouping tab as described above, and that only the groups in the right-hand box are applied. In contrast to other functions, you do not have to use a grouping for the bar plot in which case you would plot the entire dataset. However, in our example the first grouping column will again lead to a data split along the x-axis and the second one will be used for colouring - the very same way as in the previous tutorial for the boxplot. Now, we have set the colours to red and dark green for the "Old" and "Young" groups, respectively. Now, for the selection of the plot property, note that both numerical and non-numerical data can be used. For non-numerical data such as the "Vfamily" BRepertoire internally counts the occurrences of the levels in the data partitions and uses this count for plotting. This can be useful, if you want to compare immunoglobulin classes for example. However, in this tutorial we will again use "Pepstats_length" to proceed with our analysis. Below this checkbox input you can set further options. Firstly, you need to decide whether you want relative or absolute values. While the latter are useful to look at absolute differences, relative values will be probably preferred in cases where the number of observations in the different groups varies. Although we have a comparable number of observations for both the "Young" and "Old" groups, we will opt for that here. We also need to select the property values we want to plot. The number of amino acids in our data set ranges from 3 to 64, remember that we filtered CDR3s with less than 3 amino acids in the filtering step. The most interesting region in this case is the range between seven and twelve amino acids. Since we are now using only a fraction of the data for plotting, we might also use the "In respect to all data" option. This will result in the relative value calculation being performed on all the data, while the plot will only show the selected levels. You can try this out on your own to see the difference later. Moreover, the default setting "by group" normalizes the data per group on the x-axis. Finally, we set the available plotting parameters such as the title of the figure and click on "Generate Plot". You can immediately see the most pronounced difference in the groups, which is a massive expansion of the fractions of lengths eight, nine and ten in the "Young" group. This is likely to be caused by the vaccine, because this happens only in the samples taken one week after the vaccination. You can also see that this pattern is not visible in the "Old" group, probably indicating that the response was less intense. Since these data are relative in respect to the first grouping column, the "Sample.ID", and we selected a calculation of the values considering all data, the visible bars for "Day 0", "Day 7" and "Day 28" do not sum up to 100 percent. The next analysis will be the "Distribution analysis".
Distribution analysis
If data has been clustered using multi-dimensional data, it might be tedious to search for individual differences in numerical properties. To support the user with this task, we provide the "Distribution tab" which allows fast comparisons of value distributions and the calculation of various statistics. In our example, we will probe whether there are any differences between CDR3 characteristics of the Young group at "Day 0" and "Day 7", respectively. To that end, we first select the appropriate grouping columns, "Age.Group" and "Sample.ID". Note, that the dataset selectors are updated and allow us to specify the two data subsets we want to compare very easily. We will set dataset one to hold all entries which are "Young" and taken on "Day 0" and dataset to hold those which are "Young" and taken one week later. Note that the number of observations in the datasets is automatically shown, excluding NA containing ones. Finally, we set the colour of the second dataset to be green and the opacity for the histogram plots to be 65% to improve the readability of the plot. This function can process multiple properties simultaneously, so we will select the ten Kidera factors. We can set multiple different statistical properties to be calculated, which make different assumptions on our data and have different strengths and weaknesses. For example, the widely used t-test assumes that the data is normally distributed and the observations are independent from one another. If this is not the case, the calculated p-value might not be very reliable. The same holds for the effect size measures that are supported. To use a stable, non-parametric approach, we will use a permutation test with 9000 iterations and calculate Cliff's delta. You might also want to set some plotting parameters. At this stage, we are ready to start the calculations. Just be aware, that the permutation test will take quite a while, which is the major drawback of this method. In the generated plot, you can see the histograms where every pair of red and green bars uses the same spread to make it comparable. The number of bars differs as it is optimized by the underlying R function "hist". The calculated statistical measures are plotted in the respective data partitions. Note, that blue text means "significance", which is a value below 0.05 for the p-values and according to the respective effect size tables for the effect sizes. By using this functionality, a lot of potential major differences between parts of your data can be screened quickly and efficiently. As a final remark, we would like to add that if you are unsure which statistical test to use, you might want to perform a Kolmogorov-Smirnov or ks test, either in the standard R implementation or the weighted version we provide for tailed data which has been implemented by Rand Wilcox.
Principal Component Analysis (PCA)
A standard way to reduce a multi-dimensional space to a 2D or 3D representation is the principle component analysis or PCA. It is a linear transformation of the covariance matrix of the data, representing the data along orthogonal eigen-vectors, ordered such that the first eigenvector shows the largest variance in the data. For example, the first eigen-vector or principle component will always contain more variance than the second one and so on. In our example, we might ask ourselves whether data obtained on "Day 7" differs from the other time points, since that has been implicated by the distribution analysis described in the previous tutorial. Firstly, we select the grouping columns, namely "Patient.ID" and "Sample.ID". In this example, just using the "Sample.ID" would also be possible, but in order to plot multiple mean values, we will use a second grouping column. We can select different colours for the second grouping column. As the distribution analysis earlier suggested, there might be a difference in Kidera factors 4 and 5 so we will only use these two, to mimimise any noise. However, in principle one could use many more dimensions. We will plot the means which are calculated after the projection and label them and we will set the data ranges or ellipses and property loadings to be enabled as well. We have chosen to disable the spread annotation in this case. When we click on "Generate Plot" the calculation starts. Note, that BRepertoire notifies you when certain adaptations to the data are made for example if NA value-containing rows are removed prior to the call of the "prcomp" function. As you can see from the plot, "Day 7" has indeed a different spread whereas "Day 0" and "Day 28" are similar to each other.
Gene usage frequencies
When it comes to antibody repertoire data, another very common analysis is the monitoring of the V(D)J gene usage for different subsets of the data. The server supports 1D, 2D and 3D plots to show the fractions, but in our tutorial we will generate 3D bubble plots. First, set the grouping columns to be "Age.Group", which will split the data into "Young" and "Old", respectively and hence result in two plots. We can set the colours as usual and will set the "Young" group to be represented in dark green. Next, we select up to three dimensions for the plot. Note, that only non-numeric columns are available here. The number of properties you select will determine the dimensionality of the plot. We will select "Vfamily", "Jfamily" and "PrimaryDfamily" for which we are going to generate bubble plots. After clicking on "Generate Plot", the bubble plots are shown on the right-hand side. Note, that you might have to move up and down to update the rendering. You can rotate the plots by clicking into the plotting area and holding down your mouse button. However, in contrast to 1D or 2D plots you cannot download the plots directly. After orienting the boxes appropriately, we suggest you make a screenshot and cut the plot afterwards. You may also try to do the 1D and 2D plots by de-selecting one or two dimensions on the left on your own.
Dendrogram (horizontal clustering)
Hierarchical clustering or dendrograms can give useful insights into relationships. As usual, we select the grouping columns we want to use. Note that if two columns are specified the data is divided into all combinations of the levels in the two and the colour is determined by the last column. We will select the columns "Patient.ID" and "Sample.ID". As colours for the levels in "Sample.ID" ("Day 0", "Day 7" and "Day 28") we use red, orange and dark green. For the clustering, we can select all numerical columns in our specified dataset and we will not use the length of the CDR3 of the heavy chain. Note, that the combination of values on different scales is possible, because the data can be normalized prior to the calculations. Centring of the data in order to get rid of the intercept is also possible. By clicking on "Generate Plot", we can calculate the dendrogram. You can download the distance matrix if you wish, to get a quantification of the distances.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
There are other options to visualize high-dimensional data than PCA, among which t-SNE plots are a commonly chosen one. In contrast to PCA, the t-SNE algorithm implements a non-linear dimensionality reduction. This means, that t-SNE plots are inherently more difficult to interpret than PCA plots. The t-SNE algorithm attempts to create a low-dimensional view that contains high mutual information between the distances seen in the low dimensional projection and those existing in the high-dimensional data space. There are multiple hyperparameters one can set which will affect the result significantly. Nevertheless, t-SNE has been successfully applied in many situations and has become an important alternative to standard techniques. In this tutorial, we will split our data only according to the age and set the colours to be red for "Old" and blue for "Young". All numerical columns can be used for the calculation and we will use the ten Kidera factors. You can perform a PCA prior to the t-SNE and select only a subset to form the input for the algorithm. We will select the first 5 principal components to be used. This will hopefully reduce the noise in our data. A critical parameter for a t-SNE calculation is the number of iterations after which the procedure is stopped. This might become time-consuming easily and the computational demand is one of the short-comings of the method. We will set it to 900. The perplexity parameter relates to the standard-deviations of the Gaussians used internally in the procedure and tunes the balance between local and global aspects of your data. In general, the higher this value, the more global the separation. We will use a value of 30 with commonly used values ranging from 5 to 50. The parameter theta tunes the speed / accuracy trade-off and will be kept at 0.5 here. Finally, the epsilon parameter tunes the learning-rate of the algorithm. We will keep it at 200. You might have to play with the parameters in order to get a useful result. We refer to descriptions in the internet for further information and you might also have a look at the underlying "Rtsne" R package's description. Now, click on the button "Generate Plot" to start the time-consuming calculation. This step can take time up to a few hours, depending on the size of the dataset and the parameters. If you do not close the window, however, it should complete properly.