- retrieve networks for proteins of interest
- retrieve networks for a disease
- layout and visually style the resulting networks
- import external data and map them onto a network
- perform enrichment analyses and visualize the results
- merge and compare networks
- select proteins by attributes
- identify functional modules through network clustering
To follow the exercises, please make sure that you have the latest version of Cytoscape installed. Then start Cytoscape and go to Apps → App Manager to check for new apps, install them and update the current ones if necessary. The exercises require you to have certain Cytoscape apps installed. Search for the stringApp in the search field; if it is not already installed, select it and press the Install button to install it. Similarly, make sure you have the yFiles Layout Algorithms, enhancedGraphics, and clusterMaker2 apps installed before closing the App Manager.
If you are not already familiar with the STRING and DISEASES databases, we highly recommend that you go through the short STRING exercises to learn about the underlying data before working with them in these exercises.
In this exercise, we will perform some simple queries to retrieve molecular networks based on a protein, a small-molecule compound, a disease, and a topic in PubMed.
1.1 Protein queries
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STRING: protein query as Data Source and type your favorite protein into the Enter protein names or identifiers field (e.g. SORCS2). You can select the appropriate organism by typing the name (e.g. Homo sapiens). The Maximum number of interactors determines how many interaction partners of your protein(s) of interest will be added to the network. By default, if you enter only one protein name, the resulting network will contain 10 additional interactors. If you enter more than one protein name, the network will contain only the interactions among these proteins, unless you explicitly ask for additional proteins.
Unless the name(s) you entered give unambiguous matches, a disambiguation dialog will be shown next. It lists all the matches that the stringApp finds for each ambiguous query term and selects the first one for each. Select the right one(s) you meant and continue by pressing the Import button.
How many nodes are in the resulting network? How does this compare to the maximum number of interactors you specified? What types of information does the Node Table provide?
1.2 Compound queries
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STITCH: protein/compound query as Data Source and type your favorite compound into the Enter protein or compound names or identifiers field (e.g. imatinib). You can select the organism and number of additional interactors just like for the protein query above, and the disambiguation dialog also works the same way.
How is this network different from the protein-only network with respect to node types and the information provided in the Node Table?
1.3 Disease queries
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STRING: disease query as Data Source and type a disease of interest into the Enter disease term field (e.g. Alzheimer's disease). The stringApp will retrieve a STRING network for the top-N proteins (by default 100) associated with the disease.
The next dialog shows all the matches that the stringApp finds for your disease query and selects the first one. Make sure to select the intended disease before pressing the Import button to continue.
Which additional attribute column do you get in the Node Table for a disease query compared to a protein query? Hint: check the last column.
1.4 PubMed queries
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STRING: PubMed query as Data Source and type query representing a topic of interest into the PubMed Query field (e.g. jet-lag). You can use any query that would work on the PubMed website, but it should obviously a topic with related genes or proteins. The stringApp will query PubMed for the abstracts, find the top-N proteins (by default 100) associated with these abstracts, and retrieve a STRING network for them.
Which attribute column do you get in the Node Table for a PubMed query compared to a disease query? Hint: check the last columns.
1.5 Using the Cytoscape Search bar
The types of queries described above can alternatively be performed through the Cytoscape Search bar (located at the top of the Network panel in the Control Panel). Click on the drop-down menu with an icon for the different resources. Select one of the four possible STRING queries and directly enter your query in the text field. To change settings such as organism, click the ☰ button next to the text field. Finally, click the 🔍 button to retrieve a STRING network for your query.
In this exercise, we are going to use the stringApp to query the DISEASES database for proteins associated with epithelial ovarian cancer (EOC), retrieve a STRING network for them, and explore the resulting network.
2.1 Disease network retrieval
Close the current session in Cytoscape from the menu File → Close. Use the menu File → Import → Network from Public Databases and the STRING: disease query option from the Data Source drop-down menu. Insert ovary epithelial cancer into the Enter disease term field, set the Maximum number of proteins option to 250 and press the Import button. Once the network appears, go to the menu View → Always Show Graphics Details to see the individual nodes and edges.
2.2. Work with node attributes
Note that the retrieved network contains a lot of additional information associated with the nodes and edges, such as the protein sequence, tissue expression data, subcellular localization, disease score (Node Table) as well as the confidence scores for the different interaction evidences (Edge Table). In the following, we will explore these data using Cytoscape.
Find the disease score column in the node attributes table (look at the last columns). Sort it by values to see the highest and lowest confidence scores. You can highlight the corresponding nodes by selecting the rows in the table, bringing up the context menu (right-click the selected rows) and choosing the Select nodes from selected rows option. You can also use the Fit Selected icon in the menu bar to zoom into the selected node (View → Fit Selected).
Give an example for a node with the highest and lowest disease score.
2.3 Inspect subcellular localization data
The stringApp automatically retrieves information about in which compartments the proteins are located from the COMPARTMENTS database, which we will take a look at first to better understand the data.
Go to COMPARTMENTS and enter ARID1A into the search box. The resulting page will show all matches for the query ARID1A.
After selecting the human gene, you will see a schematic of where in the cell it is located and below it tables containing the specific lines of evidence that contribute to the overall score.
What compartments is ARID1A present in with a confidence of 5 (stars)? What source do these associations come from? Hint: you can see what the abbreviations for different evidence types mean here.
2.4 Continuous color mapping
Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map the subcellular localization data for nucleus to the node color.
From the left panel side menu, select Style (located underneath Network and above Filter). Click on the ◀ button to the right of the property you want to change, in this case Fill Color and set Column to the node column containing the data that you want to use (nucleus). Since this is a numeric value, we will use the Continuous Mapping as the Mapping Type, and set a color gradient for how likely each protein is located in the nucleus. The default Cytoscape yellow–purple color gradient already gives a nice visualization of the confidence of being located in this compartment.
Does it look like the network contains many nuclear proteins?
2.5 Select proteins located in the nucleus
Because many proteins are located in the nucleus, we will identify the proteins with highest confidence of 5. One way to do this is to use the COMPARTMENTS sliders in the STRING Results panel on the right side. In the Nodes tab expand the group of compartment filters by clicking the small triangle and find the slider for nucleus. To hide all nodes with a confidence score below 5, set the low bound to 5.0 by typing the number in the text field and pressing Enter.
Select all remaining nodes in the network view by holding down the modifier key (Shift on Windows, Ctrl or Command on Mac) and then left-clicking and dragging to select multiple nodes. The nodes will turn yellow if they are selected properly. The number of selected nodes is shown in the light grey panel bar on the bottom-right part of the network view panel, just above the Table panel.
How many proteins are found in the nucleus with a confidence score of 5? In mitochondrion? In both nucleus and mitochondrion?
Important: Move the filter back to 0.0 before continuing with the next exercise.
In this exercise, we will work with a list of 541 proteins associated with epithelial ovarian cancer (EOC) as identified by phosphoproteomics in the study by Francavilla et al.. An adapted, simplified version of their results table can be downloaded here.
3.1 Protein network retrieval
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STRING: protein query as the Data Source and paste the list of UniProt accession numbers from the UniProt column in the table into the Enter protein names or identifiers field.
Next, the disambiguation dialog shows all query terms that cannot be matched to a unique STRING protein, with the first matching STRING protein for each query term automatically selected. This default is fine for this exercise; click the Import button to continue. Check that View → Always Show Graphics Details for a detailed view of the network.
How many nodes and edges are there in the resulting network? Do the proteins all form a connected network? Why?
Cytoscape provides several visualization options under the Layout menu. Try the Degree Sorted Circle Layout, the Prefuse Force Directed Layout with score as edge weight, and yFiles Organic Layout.
Can you find a layout that allows you to easily recognize patterns in the network? What about the Edge-weighted Spring Embedded Layout with the attribute ‘score’, which is the combined STRING interaction score?
3.2 Discrete color mapping
Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map drug target family data from the Pharos database to the node color. This data is contained in the node attribute called target family.
Select Style from the side menu in the left panel (it is between Network and Filter). Click the ◀ button to the right of the property you want to change, in this case Fill Color, and change Column from name to (T) family, which is the node column containing the data that you want to use. The Mapping Type should remain set to Discrete Mapping. This action will remove the rainbow coloring of the nodes and present you with a list of all the different values of the attribute that exist in the network, in this case several protein target families.
To color the proteins in a given target family, first click the field to the right of an attribute value, i.e. GPCR or IC, then click the ⋯ button and choose a color from the color selection dialog. You can also set the default color for all nodes that do not have a target family annotation from Pharos by clicking on the grey square in the first column of the Fill Color row.
How many of the proteins in the network are ion channels (IC) or GPCRs?
There are many kinases in the network. We can avoid counting them manually by creating a selection filter in the Filter tab (located underneath Style). Click the ᐩ button and choose Column filter from the drop-down menu. Then, find and select the attribute (T) Node: family. Write kinase in the text field to select all nodes with this annotation.
How many kinases are in the network?
3.3 Data import
Network nodes and edges can have additional information associated with them that we can load into Cytoscape and use for visualization. We will import the data from the text file.
To import the node attributes file into Cytoscape, go to File → Import → Table from File. In the resulting dialog entitled Import Columns From Table, use the drop-down menu next to Where to Import Table Data to choose the option To a Network Collection. Next, change the Key Column for Network from shared name to query term and click OK.
Detailed explanation: Understanding Cytoscape's data import
The preview in the bottom of the import dialog will show how the file is interpreted given the current settings and will update automatically when you change them. To change the default interpretation of a column, click the arrow in its column heading. For example, you can decide whether the column is imported or not by changing the Meaning of the column (hover over each symbol with the mouse to see what they mean). This column-specific dialog will also allow you to change the column name and type.
Another important part is that you need to map unique identifiers between the entries in the data and the nodes in the network. The key point of this is to identify which nodes in the network are equivalent to which entries in the table. This enables mapping of data values into visual properties like Fill Color and Shape. This kind of mapping is typically done by comparing the unique identifier for each node (Key Column for Network) with the unique identifier for each data row in the table (marked with key symbol).
The Key Column for Network can be changed using a drop-down menu and allows you to set the node attribute column that is to be used as key to map to. In this case it is query term because this attribute contains the UniProt accession numbers you entered when retrieving the network. You can also change the Key by pressing the key button for the column that is to be used as key for mapping values in the dataset. In this case it is the first column in the table called UniProt, from where you copied the identifiers.
If there is a match between the value of a Key in the dataset and the value the Key Column for Network field in the network, all attribute–value pairs associated with the element in the dataset are assigned to the matching node in the network. You will find the imported columns at the end of the Node Table.
3.4 Continuous color mapping
Now, we want to color the nodes according to the quantitative phosphorylation data (log ratios) between disease (EOC) and the two healthy tissues distal fallopian tube epithelium (FTE) and ovarian surface epithelium (OSE) for the most significant site for each protein. From the left panel side menu, select Style (it is underneath Network). Then click on the ◀ button to the right of the property you want to change, for example Fill Color. Next, set Column to the node column containing the data that you want to use (EOC vs FTE&OSE). Since this is a numeric value, we will use the Continuous Mapping as the Mapping Type, and set a color gradient for how abundant each protein is. The default Cytoscape color gradient blue–white–red already gives a nice visualization of the log ratio.
Are the up-regulated nodes grouped together? Do you see any issues with the color gradient?
To change the colors, double click on the color gradient in order to bring up the Continuous Mapping Editor window and edit the colors for the continuous mapping. In the mapping editor dialog, the color that will be used for the minimum value is on the left, and the maximum is on the right. Double click on the triangles on the top and sides of the gradient to change the colors. The triangles on the top represent the values at which the data will be clipped; anything above the right triangle will be set to the max value. This is useful if you have a small number of values that are significantly higher than the median. As you move the triangles and change the color, the display in the network pane will automatically update – this is all easier to do than to explain! If at any point it does not seem to work as expected, it is easiest to just delete the mapping and start again.
Can you improve the color mapping such that it is easier to see which nodes have a log ratio below -4 and above 4?
3.5 Network clustering
Next, we will use the MCL algorithm to identify clusters of tightly connected proteins within the network. To do that, press the Cluster network (MCL) button in the STRING Results panel on the right side of the network view. Keep the default granularity parameter (inflation value) set to 4 and click OK to start the clustering. The clusterMaker app will now run the algorithm and automatically create a network showing the clusters.
How many clusters have at least 10 nodes?
We will work with the largest cluster in the network (it should be in the upper left corner). Select the nodes of this cluster by holding down the modifier key (Shift on Windows, Ctrl or Command on Mac) and then left-clicking and dragging to select multiple nodes. The nodes will turn yellow if they are selected properly. Then, create a new network by clicking on the New Network from Selection button and choosing the option From Selected Nodes, All Edges or via the menu item File → New Network → From Selected Nodes, All Edges.
How many nodes and edges are there in this cluster?
The cluster is very dense and almost fully connected, i.e. it has edges representing functional associations between almost all pairs of nodes. Change the network type to physical interactions by navigating to the Edges tab in the STRING Results panel and clicking the Change network type button. Leave the confidence cutoff at the default value, change the network type from full STRING network to physical subnetwork using the drop-down menu, and click OK. To better see the new set of edges, apply a layout of your choosing, e.g. the yFiles Organic Layout.
How many edges does the resulting network contain and why are there now fewer edges?
3.6 Functional enrichment and enriched publications
Next, we will retrieve functional enrichment for the proteins in our network of the largest cluster. After making sure that no nodes are selected in the network, go to the menu Apps → STRING Enrichment → Retrieve functional enrichment or use the Functional Enrichment button in the Nodes tab of the STRING Panel on the right side. Then, select the original, not clustered network ‘String Network’ as Background (instead of ‘genome’) and click OK. A new STRING Enrichment tab will appear in the Table Panel on the bottom. It contains a table of enriched terms and corresponding information for each enrichment category. You can see which proteins are annotated with a given term by selecting the term in the STRING Enrichment panel and you can see the terms annotating a given node by slecting it.
How many statistically significant terms are in the table? Which is the most significant term for each of the categories GO Biological Process, GO Molecular Function, and KEGG Pathways? Hint: Look at the FDR (false discovery rate) value column and use the Filter button to select individual categories.
Next, we will visualize the top-5 enriched terms in the network using split charts, click the colorful chart icon to show the terms as the charts on the network. You can manually change the layout of the network to improve the visualization. First apply the yFiles Organic Layout and then scale the network to reduce the overlap of the charts using the Layout Tools (Layout → Layout Tools).
To save the list of enriched terms and associated p-values as a text file, go to Apps → STRING Enrichment → Export enrichment results.
To retrieve a list of publications that are enriched for the proteins in the network, go to the menu Apps → STRING Enrichment → Retrieve enriched publications or press the Enriched Publications button. A new tab called STRING Publications will appear in the Table Panel on the bottom. It contains a table of enriched publications and associated information such as how many of the network proteins were mentioned in each publication.
What is the title of the most recent publication?
3.7 Overlap networks
Cytoscape provides functionality to merge two or more networks, building either their union, intersection or difference. We will now merge the EOC network we have from the DISEASES query with the one we have from the data, so that we can identify the overlap between them. Use the Merge tool (Tools → Merge → Networks…) and select the Intersection button. Then, select the two STRING networks from Available Networks list (‘String Network - ovary epithelial cancer’ and ‘String Network’). Click on > to add them to the list of Networks to Merge and click Merge.
How many nodes are in the intersection?
3.8 Integrate networks
Now we will make the union of the intersection network, which contains the disease scores, and the experimental network. Use the Merge tool again to make the Union of the merged network and ‘String Network’. Make sure that the new merged network has the same number of nodes and edges as ‘String Network’, and that some nodes have a disease score (look for the column with this name and sort it by clicking on the column name).
Now, we can change the visualization of the merged network to look like a STRING network and to be able to identify proteins with a high disease confidence score. Specifically, we will change the size of the nodes in function of their disease score. Select Style in the Control Panel and click on the drop-down menu to change the style from default to STRING style v1.5. Then, click on the Lock node width and height option to enable it so that the nodes have only one attribute Size instead of two attributes Height and Width. Modify the values so that by default a node size is 30. To change the default value, you have to click on the default 35.0 value at the left of the Size attribute. Click on the ◀ button to add a continuous mapping of the Size attribute using the disease score. The mapping should go from 40 for the lowest disease score to 80 for the highest score. To change the mapping values, first double click on the chart and then double click on the square corresponding to the value you want to modify and set the value you want (40 and 80). Remember to show the graphics details as well as to use a layout that allows you to see all nodes in the network (e.g. yFiles Organic Layout).
Which protein has the highest confidence score for association with EOC according to DISEASES? Hint: sort the disease score column or find the largest node in the network view.
In this exercise, we will retrieve virus-host networks for two closely related viruses, merge them into a single network, and then will retrieve the functional enrichment for the host proteins in this network.
4.1 Virus queries
Go to the menu File → Import → Network from Public Databases. In the import dialog, choose STRING: protein query as the Data Source. As of version 1.4 of the STRING app, 236 virus species are included in the species dropdown menu. Since most viruses are small (they have a median of 9 proteins in their genomes) it is reasonable to import all proteins of this species for a given virus, so select this checkbox underneath the species dropdown. For this example we will query all proteins of “Human papillomavirus type 16 (HPV 16)”. Simply type HPV 16 and select the species from the resulting shorter dropdown menu.
How many virus proteins are encoded for by this virus? What node information is imported along with the names of the proteins?
4.2 Expand with host interactors
To retrieve interactions with host proteins, go to Apps → STRING → Expand network. In the resulting dialog, enter the number of desired host proteins, and select the host species from Type of interactors to expand network by. All host species for which we have interactions with the currently imported virus genes, will be shown in the dropdown menu. The selectivity of interactors can also be specified – we recommend a default value of 0.5, but you can move the slider towards 0 to decrease the number of network-specific interactors or towards 1 to increase it. In this example, we will import 10 human proteins, and keep the default selectivity.
The resulting network will be automatically re-styled such that the nodes representing virus proteins are red and host proteins are green-blue. These attributes can be changed from the Cytoscape Style menu.
Which human protein has the highest interaction score to one of the virus proteins? What cellular functions is this protein involved in? (Hint: open the results panel under Apps → STRING → Show results panel.)
Additional viruses or hosts can be added to the network by iterating on this procedure, but this will only add proteins that interact with the proteins that are already in the network. This will work fine when adding new hosts, since all virus proteins are already in the network. However to add new viruses, we recommend merging the expanded networks for each virus.
4.3 Add specific host proteins
If a specific host protein is desired, it can also be included in the network from the Apps → STRING → Query for additional nodes menu option. In this example, p53 is not one of the proteins that was included in the network in the previous step, however it is known that the HPV E6 protein mediates ubiquitination of p53. To include this protein, choose “Homo sapiens” for the species (you may have to scroll up in the list), and enter “tp53” into the text area box in the dialog, then click Import.
Which HPV proteins does p53 interact with?
Note that p53 will be added to the network in the previous step if more proteins are imported or the selectivity is set to a lower value. Choosing a lower selectivity will include more hub proteins (such as p53) that are connected to many proteins, and that do not interact specifically with proteins in your network. Conversely, choosing a higher selectivity will include more proteins that are more specific to your network, but these interactions will have lower confidence (since any higher confidence hub proteins will be filtered out). Further, be aware that changing the selectivity parameter will change the enrichment results in step 4.5, since different proteins will be included in the host network.
4.4 Merge two host-virus networks
Let us now compare the networks for HPV 16 and HPV 1a. Create a new host-virus network for “Human papillomavirus type 1a (HPV 1a)” by repeating steps 4.1 and 4.2. Merge the two networks using Tools → Merge → Networks. Move both the HPV 16 and HPV 1a networks into the Networks to merge box and otherwise use the defaults for the merge. In the resulting network, use the menu option Apps → STRING → Set as STRING network to manipulate the network as a STRING network again. To show any interactions between host nodes that were present in one source network but not the other, first set the confidence to 1 using Apps → STRING → Change confidence and click OK. Then set the confidence to the desired confidence (0.4) to retrieve any missing interactions.
The resulting network can be styled to give the nodes of each species a distinct color so that the proteins of the two viruses can be distinguished from each other.
How many host proteins interact with E6 from both HPV species?
4.5 Functional enrichment
We will now examine the human proteins to see what pathways are enriched in this network.
Next, we will retrieve functional enrichment for the human proteins. Go to the menu Apps → STRING Enrichment → Retrieve functional enrichment and keep the default settings. Homo sapiens will be selected by default in the species dropdown. It is currently only possible to retrieve enrichment for host proteins. A new STRING Enrichment tab will appear in the Table Panel. It contains a table of enriched terms and corresponding information for each enrichment category. Use the filter button in the top left of the STRING Enrichment panel to show only KEGG Pathways. Click on the draw charts icon to the right of the filter icon to plot the enrichment values on the network.
Which two KEGG pathways have the lowest p-values? Which host proteins are associated with the KEGG pathways “cell cycle”? (Hint: click on the associated row in the enrichment table to select the proteins with this term.)
The theoretical background for these exercises is covered in these short online lectures:
Doncheva NT, Morris JH, Gorodkin J and Jensen LJ (2019). Cytoscape stringApp: Network analysis and visualization of proteomics data. Journal of Proteome Research, 18:623-632.
Abstract Full text Preprint