More challenges for machine learning protein protein interactions

All test set and cross-validation splits used in our analyses are available here. Corresponding sequences and evolutionary profiles can be downloaded for human (fasta, profiles) and yeast (fasta, profiles).


The data is available under Academic Free License, v3.0 (AFL-3.0).

Folder structure and naming conventions

  • The datasets used in the paper "Evolutionary profiles improve protein-protein interaction from sequence" can be found in subfolder "redundancyTest/nonRed/<C1,C2,C3>/<yeastCV,humanCV,humanNew>/"
  • Folders C1,C2 and C3 always refer to the respective difficulty of the prediction problem
  • In folder "negativeSimTest", sequence similarity between negative training and test PPIs has either been allowed (subfolder "nonRed") or not (subfolder "nonRed_noSim").
  • In folder "redundancyTest" the three subfolders "nonRed", "iaRed" and "seqRed" correspond to the three kinds of redundancy amongst training PPIs.

Every analysis has been repeated 10 times from the start, resulting in 10 "split_" folders for every subfolder mentioned above. In each "split_" folder, there are either 10 training sets (for cross-validations) or only one (for tests on new data). For each training set, there is 1 test set. Positive and negative PPIs are always in separate files.


