Each of the following graphs represents a different taxonomic organization or stage of processing of the same sequence data. Each square represents an operational taxonomic unit (OTU) and the color of the square indicates the percentage of all sequence reads that matched this OTU for the given sample. Darker squares represent OTUs that occurred in greater proportion within the given sample.

Experimental OTUs from Raw Sequence Data

Here, squares represent operational taxonomic units based purely on clustering algorithms that group together reads that have at least 97% correspondence with one another. This data was produced by my own experimental attempts at running the standard USEARCH tool. At this point, in my own runs of USEARCH, as many as 50% of the original reads from the sequencing machine have already been discarded as being too likely to have an error -- or being suspected chimeras.

> ...

OTUs from the Sequencing Company

Small changes in the parameters to the data processing pipeline can produce wildly different initial OTUs. I produced 8702 OTUs when I ran USEARCH (above). Yet, the analysis company produced 13856 with the same data and software, though clearly very different parameters along the way. We don't even know what their parameters were, only having received the final stage output files from them as the presumed "starting point" for our own work.

> ...

OTUs from the Sequencing Company with no hits highlighted

Here, all of the blue squares represent OTUs which matched, at some level, sequences of bacteria already in a known database. The orange squares highlight the wide variety of OTUs that will be lumped together in the "No Hit" category for all future analysis because these sequences did not match any sequences already listed in established databases.

> ...

Species Taxonomy from Sequencing company

Here we can see that all of the No Hit results have been clustered into a single category.

> ...

Trimmed Species Taxonomy from Sequencing company

It is unclear to me exactly how or what was trimmed to create this level of data representation. I think that perhaps OTUs which matched sequences in the queried database but could not be identified as a specific species have been clustered together here. So, for example, we see things that are categorized as of a particular Kingdom and Phylum but at any level more detailed than that are simply 'unknown.' This would suggest that the unknown categories will over-represent in various visualizations. In these categories, many diverse things have been clustered together at a higher level which makes that data point appear larger than others, despite that it is actually an amalgamation of multiple smaller data points?

> ...