Pubnet Instructions

Overview

What is PubNet?

PubNet is a utility that accepts as input up to two PubMed queries, and returns as output a network graph (in multiple image formats) based on user-specified node and edge selection properties. Nodes represent data items associated with publications returned by the queries (such as paper ids, author names, and databank ids), and edges represent instances of shared properties. PubNet can be used to visualize a variety of relationships, such as the degree to which two authors collaborate or the MeSH Term relatedness of publications with PDB ids. The visualization is done with the aid of aiSee.

Generating a graph

Type or paste any Entrez-PubMed query into the blue box labeled "Query 1". See below for further details.
(optional) Type or paste a second query into the yellow textbox.
Choose a Node type and Edge type from the selection boxes at the bottom.
Click Submit.

Interpreting the Graph

Details for interpreting the graph given below.

Graphs are generated by parsing the XML file returned by a PubMed query.
Each node on the graph represents an entity chosen from Node selection box on the main page.
Edge is present between two nodes if they share at least one term as chosen from the Edge selection box.
Edges are colored and (optionally) weighted according to number of terms (darker & thicker = more shared terms)
Nodes generated from papers only appearing in the first query set are colored blue.
Nodes generated from papers only appearing in the second query set are colored yellow.
Nodes generated from papers appearing in both the first and second query sets are colored green.

Node Selection

Paper: Each publication (uniquely identified by PMID) returned by the query is represented by a node on the graph.
Author: Each author (identified by "FirstInitial LastName") gets a separate node with this option. A single paper with several authors will be drawn as several nodes.
PDB ID: When available, PDB identifiers are included in the XML output of a PubMed query. When this option is selected, the string "PDB[si]" is appended to the query to ensure only records with PDB ids are returned. Each PDB id is then represented as a separate node (even if a single paper has multiple PDB ids).
GenBank ID: Similar to PDB ID, except the string "GENBANK[si]" is used. GenBank ids are then represented as nodes.
SWISSPROT: Similar to PDB ID, except the string "SWISSPROT[si]" is used. Swiss-Prot accession numbers are represented as nodes.

Edge Types

Co-Authorship: Two nodes are linked by an edge if their respective originating publications have at least one author in common.
Shared MeSH Term: two nodes are linked by an edge if their respective originating publications have at least one MajorTopic MeSH term in common. A MajorTopic MeSH term is defined as having shared term xxxxx, appearing in the XML output as follows:

<MeshHeading>
<DescriptorName MajorTopicYN="Y">xxxxx</DescriptorName>
</MeshHeading>

or

<MeshHeading>
<DescriptorName MajorTopicYN="N" />
<QualifierName MajorTopicYN="Y">xxxxx</QualifierName>
</MeshHeading>

In many cases, MeshHeadings will have several QualifierNames where MajorTopicYN = "Y". In each case, the QualifierName is appended to the DescriptorName (separated by a space) and each combination is treated as a separate MeSH term.
Shared Location: Two nodes are linked if identical 5-digit numerical codes appeared in their publications' respective <Affiliation> tags. Note that this simple approach really only works for United States addresses at this time. Zip codes are extracted using the following regular expression: /\W(\d{5})\W/. Because US zip codes use a hierarchical convention, we allow precision to be specified to group locations that share the first 3 or 4 digits of a 5-digit zip code. For example, a 3-digit prefix is extracted using the regular expression: /\W(\d{3})\d{2}\W/. In the happy event that there is demand for support of non-US affiliations, a more sophisticated method may be developed in the future.

Tips for Successful Queries

Complexity of PubNet graphs can scale exponentially with the number of nodes. It has been my experience that graphs with more than 1000 nodes are difficult to interpret, and they can take a very long time to load, if they load at all. To get better results, try the following:

Keep in mind that PubNet queries are actually PubMed queries. Practice on PubMed first. It's faster without the added steps of waiting for PubNet to try to process some ridiculous query. The syntax for PubMed is fairly idiosyncratic, so you should go here to learn it. First get something that returns only a few hundred papers, then switch to PubNet.
Check out this page for additional information.

If you're anything like me, the first thing you'll search for is your own name. So we'll make that our first example. First, you should read the PubMed help section on Author Names. It's surprising how easy it is to retrieve false positives:

Query	Papers returned
Douglas	10461
S Douglas	4051
Douglas S	701
Douglas SM	10
(Douglas SM[Author] NOT (1959[dp] : 2002[dp]))	4
Shawn Douglas	4

It's more useful to search "lastname initials". A middle inital helps, but only if the author uses it. Date ranging is useful if there isn't any temporal overlap. The best method is probably to search with the full first and last name.

Limit the date range of your query. Read the help section on Dates and Date Ranging for how to do this.
Don't search for stuff like "cancer" (1,270,000+ papers) or "findings" (5,110,000+ papers). If you want to search for something that returns a ton of hits, you need to limit the dates, or add some additional search terms, or both (see Boolean help).
If you choose a node parameter other than papers, each paper can return many nodes (especially those Science papers with 200 authors). GenBank nodes are particularly difficult to use, since everything seems to have a thousand genes associated with it. Again, specific queries and date ranging are useful here.

Download PubNet Code

The code for PubNet may be downloaded here: PubNet Code