What is ScrubChem?
ScrubChem is a digital curation of PubChem Bioassay (694 million bioactivities, 9,000 molecular targets, 2.3 million chemicals), designed to build datasets and enhance research.
ScrubChem expands bioassay concepts, corrects formatting errors, and fills data gaps related to interoperability issues between bioassay records.
The result of cleaning this data is a larger number of retrievable and comparable results usable for building datasets.
Why use ScrubChem?
The PubChem Bioassay database is a non-curated public repository with bioactivity data from many sources, including: ChEMBL, BindingDb, DrugBank, Tox21, NIH Molecular
Libraries Screening Program, and various academic, government, and industrial contributors. Bioactivity datasets built on PubChem data can be the largest and most diverse
for use in data-driven discovery & development. However, this data is difficult to use in aggregate form, mainly due
to lack of interoperability and standardization among its 1.2 million assay records. Methods for extracting this public data into high-quality, computable datasets,
useable for predictive and analytical research, presents several big-data challenges for which ScrubChem is being developed as a manageable solution. To learn more,
refer to the
Poster. A manuscript will soon be released with additional details.
How to use ScrubChem?
Currently, this website provides searching records in the ScrubChem database. Searches can be done on Targets, Chemicals, and Bioassay records. A few pre-built datasets are also available to display how queried data can be further organized. Similar datasets can be added upon request.
- Learn How to Interpret Data Tables?
- Learn How to Aggregate Data into Hit-Calls & Datasets?
- Learn more Definitions & Concepts.
- Learn Advanced Features: 1.) API Access to database & 2.) Embed Tables in your website.
- Request large or custom data (No data scraping).
- Register to download data.
- Cite ScrubChem.org by Jason Bret Harris
- Download ScrubChem Icons
- Last Update: September 27, 2016 (More frequent updates will follow after publication.)
- Please report any issues and send your suggestions to the Developers.
- This is a beta release and additional features are still planned.
2. Embed ScrubChem Tables (using iframe)
Records (rows) in a Data Table
Each row in a Data Table contains a Test ID (TID) and a reported 'Value' for a Substance (SID) that has been tested and recorded in a PubChem BioAssay record (AID and MID).
These TIDs and their 'Values' are joined in this table with additional meta-data about the bioassay, including: the experimental 'Outcome', TID info,
Target info, and Chemical Info. There are over 150 meta-data fields but most fields are only viewable by using the ScrubChem API to download data in bulk.
There is a button to Filter Records by 'Justifications' which is a powerful ScrubChem feature to highlight TIDs that contain 'Values' or measurements used to best justify a reported 'Outcome'.
Read the How to Aggregate Data to understand the process of making hit-calls and datasets from these fitlered Records.
Filtering by 'Justifications'
This filters Data Tables to show only the Records with Test IDs (TIDs) most likely to contain 'Values' used in justifying each experimental 'Outcome'.
This filters out Records with Test IDs (TIDs) that contain 'Values' for meta-data which are not directly related to an 'Outcome'.
A special case ('Unjustifications') exists due to some depositors of data ommitting a direct linkage between some 'Outcomes' and TIDs. In these Records you will see the flag 'UNJUST' (ujustified).
For these records, there are no TIDs made explicitly available for determining the 'Justification', so the first available TID is retained instead which may or may not be the 'best'
TID to use as a 'Justification'. This rentention method is used only for the purpose of displaying these Records and their associated meta-data for your review. An additional data
field called 'Justified TIDs' is available which can be used in many of these cases to re-establish a linkage using TIDs available for other Substances (SIDs) in the same Bioassay Record (AID).
This approach is most useful in cases when the 'Outcome' is 'Inactive' since analysis of many Records indicates that the omission of TIDs is mostly in cases of a measured 'Value' of Zero (a low sigal response).
Instead of reporting a Zero, depositors omitted providing any 'Value' at all (reporting a null) which results in the ommission of a link for a relevant TID to use as the 'Justification'.
Grouping Similar Data into Hit-Calls
Handling of Important Variables (e.g., Modality)
- Modality information is resolved and used to aggregate data from similar studies. Other variables may also be important.
– mode of target being tested (e.g., agonist, antagonist, inhibitor, activator).
– specified by a depositor as active, inactive, inconclusive, or unspecified.
– number of outcomes reported and used in deriving a hit-call.
– number of outcomes supporting a hit-call out of the total (N) reported outcomes.
– decimal representation of the Fraction.
– derived from the ratio of agreement, with a threshold greater than 0.5, for outcomes of a given chemical and tested modality.
– A hit-call made for a unique protein and modality pair.
Combined Modalities Hit-Call
– A hit-call made from combining all protein-modality hit-calls for the same protein (active if observed as active in at least one protein-modality hit-call).
– Number of modalities combined together to make a final hit-call.
– Total evidences (N) used in a hit-call made from combining modalities.
Critical Terminology (α NCBI Term, β ScrubChem Term):
- α Assay ID (AID) – An individual assay record ID.
- α Panel Member ID (MID) – An ID for related assays (e.g., counter or toxicity screens) that are nested within the same parent AID.
- α Test ID (TID) – An internal ID used in each AID to provide a link (one to many) between data descriptions and results. Each new description field (e.g., EC50, Activity at 0.5um, …) generates a new TID number (e.g., 1, 2, …) for which many Substances (SIDs) and their results can be mapped if it is applicable to their test design.
- α Substance ID (SID) – An assigned ID for each substance in a depositor’s assay record (many SIDs for each CID).
- α Compound ID (CID) – A PubChem assigned ID for each unique chemical structure (single CID for many SIDs).
- α GenBank ID (GI and Accession) – GenBank reference number for a protein sequence record (many GIs for the same protein whereas Accessions are one-to-one by using an appended number for version control). *ScrubChem converts GIs to Accessions since GenBank deprecated the use of GIs as of September 2016.
- α Taxonomic ID (TaxID) – A numerical ID for each unique organism.
- α, β Outcome – Depositor specified result assigned to each tested substance: 0 = Inactive, 1 = Active, 2 = Inconclusive, 3 = Unspecified, 4 = Probe (very active).
- β Justifications – Standardized annotation tags for substances and their TIDs which describe measurements used to justify the reported Outcome (e.g., AC, TC_single, TC_min, TC_max, PubVal, AC_fix).
- β Best Justification – Highest ranked Justification tag when more than one occur (e.g., AC > TC).
- β WorthyTag –Standardized annotation tag for TIDs describing measurements (Bioactivities) worthy to be considered as putative Justifications for the reported Outcome.
- β Justified TIDs –Record of the final TIDs used as Justifications in each assay.
- β UNJUST – Flag used when a substance’s Outcome does not link to a TID determined to be a Best Justification for the assay. This flag is placed on any single available TID for reference purposes so that an alternative TID link can be considered to describe the Outcome’s Justification.
- α Active Concentration (AC) – Flag for a TID referring to a single summary value from a dose concentration response curve (e.g., EC50).
- α Test Concentration (TC) – Flag for a TID referring to an individual test dose concentration (e.g., Activity at 0.5um).
- β TC_min/TC_max/TC_mid – Flag for a TID referring to either a minimum, maximum, or middle- range dose concentration.
- β PubVal – Flag for a TID referring to a literature-published measurement (usually from the source ChEMBL).
- β AC_FIX – Flag for a TID identified to be needing a mark of AC or TC which was not given by the depositor (e.g., a TID describing an EC50 that was not flagged as an Active Concentration “AC”).
- β Bioactivity – A reported measurement or endpoint for an assayed substance which is given a WorthyTag.
- β Hit-Call – Summary activity determined from aggregated data on the same chemical and target.
There is to be no scraping or programatic access of information on this website without express permission. ScrubChem and its agents can not guarantee the accuracy
of any information provided and do not advise the use of this data for medical decisions.
This data and software are provided “as is”, “where is” and without any express or implied warranties, including, but not limited to, any implied
warranties of merchantability and/or fitness for a particular purpose, or any warranties that use will not infringe any third party patents,
copyrights, trademarks or other rights. In no event shall ScrubChem, nor its agents, employers
or representatives be liable for any direct, indirect, incidental, special, exemplary, or consequential damages however caused and on any
theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way or form out of
the use of this data or software, even if advised of the possibility of such damage.
All materials on these pages are copyrighted and all rights reserved unless otherwise indicated.
No part of these pages in any form, either text or image may be used for any purpose other than personal use.
Restrictions include but are not limited to derivative works, reproduction, modification, storage in a retrieval
system, retransmission by any means, electronic, mechanical or otherwise.