Data Sharing and the Digital Science Commons
by: Mustafa Ünlü, Associate Editor, MTTLR
The Supreme Court has confirmed that copyright does not, and was not meant to, protect published data.7 The Court's rationale rests on principles that uphold the commons. “The very object of publishing a book on science or the useful arts is to communicate to the world the useful knowledge which it contains. But this object would be frustrated if the knowledge could not be used without incurring the guilt of piracy of the book.”8 In spite of the commitment to open access after publication, post-publication privatization inevitably leads to interactions between upstream data sharing and exclusive IP rights.9
This unique data ownership IP regime, which has arguably been in place for centuries,10 is coming under increasing pressure in two separate ways. The rate at which the commons is becoming privatized is decreasing the incentives to share, while the data itself, due to the growing size and complexity of outputs, is becoming harder to disseminate. Scholars have noted that accelerating commercialization of downstream inventions has undermined informal sharing norms for unpublished data.11 At the same time, technological advances have caused “fundamental shifts in the practices and structures of scholarly communication”12 as data has “become more complex, more extensive, more elaborate [and] more community-based.”13 These disparate sources of stress have combined to bring about a “sea change” in the “nature of biological inquiry” and scientific norms related to data sharing.14 As a result, the science commons has not benefited from the Internet-enabled efficiency gains which have brought about tremendous advances in the applied technology and commercial spheres such as those attained by Google in its search engine implementation.15
This post limits itself to analyzing the liability ramifications of a technological solution to the second problem – that of burgeoning datasets of increasing size and complexity (“BDISCs”) as obstacles to scientific progress. A digital infrastructure that allows wide-spread sharing of BDISCs throughout the scientific community may contribute to the future shape of the commons in ways that go beyond simply fixing the problem at hand by causing the scientific community to reconsider and revamp the rules of data ownership in both the pre- and post-publication stages, but that subject matter is better left for exploration at a later time.
Though certain specialized disciplines have already implemented norms of data sharing, enforced by either journal editors or policy guidelines,16 technological solutions for enabling access to and propagating BDISCs have lagged behind. Thus, at the cutting edges of research, in areas such as genomics and proteomics, the BDISC problem is the most severe. Such projects are growing more complex and interdisciplinary and are generating increasingly larger and richer datasets.17 It is here that the scientific community's need to share, access, and annotate data is the greatest.
In the proteomics arena, one technological solution to this problem has been to deploy a novel, P2P based network for data sharing.18 Tranche combines the sharing efficiency and scalability of a BitTorrent network with a secure, encrypted storage system that allows data owners to retain control of disclosure.19 As a free, open source tool, Tranche has gained acceptance in the community and, as of this writing, is hosting several thousand proteomics-related datasets.20 It is worth adding that though Tranche was developed to address problems of data sharing in the proteomics context, the solution it embodies should be generally applicable across all scientific disciplines.
Tranche promises to enhance and change the manner in which the science commons is constructed. It enables temporal persistence of large-scale data and its associated identifiers and annotations, a very desirable improvement which is otherwise challenging to implement under the traditional print-based data sharing systems. It allows instant and widespread data sharing and gives the data owner the ability to choose from various licenses under which data is shared.21 Content owners can further control access by selective distribution of decryption keys and allowing access to either the entire community or a designated group or individuals. Thus, once granted, access privileges can also be amended over time. In this way, Tranche permits the continued operation of the pre-publication, informal rules of data sharing as well as the post-publication commitment to the public domain. Fully exploring the impact of an infrastructure that allows access to data by the entire scientific community in an immediate, efficient, near-zero cost manner at all stages of publication is, as mentioned, beyond the scope of this post. I will instead end with a brief examination of the liability ramifications of deploying a free P2P network for data sharing.
Due to involvement in several high-profile copyright infringement cases (where the operators lost), P2P networks do not currently enjoy a good reputation with content owners.22 One obstacle Tranche faces in gaining widespread acceptance could therefore lie in the perception that it might potentially enable the sharing of infringing content. Liability for Tranche operators would be based on a theory of inducement to infringe, as articulated by the Supreme Court in its landmark P2P decision.23 Tranche has several features that undercut such an inducement theory. First, unlike the defendants in Grokster, Tranche operators neither advertise nor otherwise encourage infringing activity. In fact, the opposite is true - Tranche is first and foremost a tool to share large datasets. Any sharing activity not related to this primary goal would be unwelcome since it would degrade the network's performance. Tranche is also well-suited to removing infringing content and users (provided that they can be identified) since uploading privileges are granted at the discretion of the operator and may be revoked.24 Finally, in a long line of cases involving technological tools capable of being used in a non-permissible manner, courts have recognized that liability does not attach to the developer of the tool “if the product is widely used for legitimate, unobjectionable purposes. Indeed, it need merely be capable of substantial noninfringing uses.”25 There is no question that Tranche is not merely capable of such use, but that a substantial portion of its actual use is for legitimate purposes. Therefore, any liability concerns for deploying and/or using Tranche based on its P2P nature should be minimal.
The manner in which scientists share data has a vitally important influence on the shape and scope of the science commons. In other words, the commons is shaped both by the rules under which it operates and the technology that enables it. Though existing rules and technology have stayed fairly constant for a very long time, both are under pressure to change from various quarters. Tranche is a P2P solution that utilizes Internet technology to modernize data sharing at a fundamental level. As either Tranche or a similar tool gains widespread acceptance and use in the community, the scientific commons will take a big step towards becoming entirely digital. Even though Tranche fits in with and allows the continued operation of existing rules of data sharing, it also provides more options and flexibility to both producers and consumers of the commons. This, in turn, will also almost certainly require a re-thinking of the rules governing data sharing. Tranche's technological capabilities should allow the community to move with equal ease towards a more market-based model favoring privatization, as advocated by at least one scholar,26 or to stick with and expand upon the commons ideal.
1. For the purposes of this post, “Data” may be defined as “experimental observations, results and related research methodologies.” See also California Institute for Regenerative Medicine, Intellectual Property Policy for Non-Profit Organizations, 2-3, (PDF)(defining “Data” and “Biological Materials”).
2. Comm. on Responsibilities of Authorship in the Biological Scis., Nat'l Research Council, Sharing Publication-Related Data and Materials, 1, 21 (2003), available at http://books.nap.edu/openbook.php?record_id=10613 [hereinafter Sharing] (“The publication of experimental results and sharing of research materials related to those results have long been key elements of the life sciences.”); John Wilbanks, Cyberinfrastructure For Knowledge Sharing, Ctwatch Quarterly, Aug. 2007, [hereinafter Cyberinfrastructure] (“Knowledge sharing is at the root of scholarship and science.”).
3. Sir Isaac Newton's aphorism “If I have seen further, it is by standing on ye sholders of Giants,” is often quoted as embodying this principle, but the origins of the concept precede him. See Robert K. Merton, On the Shoulders of Giants, 9 (Univ. of Chi. Press 1993), available publicly at http://books.google.com/books?id=o90uC4jMw1EC.
4. Richard R. Nelson, The market economy, and the scientific commons, 33 Res. Pol'y. 455, 457-59 (2004).
5. The exception is large-scale government funded projects with formalized data sharing goals. J.H. Reichman & Paul F. Uhlir, A Contractually Reconstructed Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment, 66-SPG L. & Contemp. Probs. 315, 333-36 (2003).
6. See id. at 349-51 (discussing the legal regime that governs the zone of informal data exchange amongst scientists).
7. See Feist Publ'ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 349-350, (1991) (“[R]aw facts may be copied at will. This result is neither unfair nor unfortunate. It is the means by which copyright advances the progress of science and art.”).
8. Id. at 350, quoting Baker v. Selden, 101 U.S. 99, 103 (1880).
9. David W. Opderbeck, The Penguin's Genome, or Coase and Open Source Biotechnology, 18 Harv. J. L. & Tech. 167, 173-87 (2004) (discussing the various ways recent changes in IP protections effect the biotechnology commons).
10. Sharing, supra note 2, at 27; Cyberinfrastructure, supra note 2 (“[T]his system has served science extraordinarily well over the more than three hundred years since scholarly journals were birthed.”).
11. Rebecca Eisenberg, Proprietary Rights and the Norms of Science in Biotechnology Research, 97 Yale L.J. 177, 177 (1987). Nelson, supra note 4, at 455.
12. Clifford Lynch, The Shape of the Scientific Article in The Developing Cyberinfrastructure, Ctwatch Quarterly, Aug. 2007, http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/
13. Id.
14. Nat'l Research Council, Reaping the Benefits of Genomic and Proteomic Research, 1 (2006), available at http://books.nap.edu/openbook.php?record_id=11487 [hereinafter Reaping].
15. Cyberinfrastructure, supra note 2 (“The materials that underpin [data], are 'dark' to the Web, invisible, and not subject to the efficiency gains we take for granted in the consumer world.”).
16. See, e.g. Sharing, supra note 2, at 4; Lynch, supra note 12 (“[S]pecific communities . . . have established norms, enforced by the editorial policies of their journals, which call for deposit of specific types of data within an international system of data repositories.”).
17. Reaping, supra note 14, at 42.
18. Tranche Project Homepage - Secure Scientific Data Dissemination, http://tranche.proteomecommons.org/, (last visited Aug. 28, 2008).
19. Id.
20. Id.
21. Tranche Project - Quick Start: Uploading Datahttp://tranche.proteomecommons.org/users/quickstart-upload.html, (last visited Aug. 28, 2008).
22. Mark G. Tratos, The Impact of the Internet & Digital Media on the Entertainment Industry, 896 Prac. L. Inst. 133, 234 (2007) (“[W]here peer-to-peer filing sharing companies were once ignoring (and in some cases, promoting) illegal file-sharing, these same companies are scrambling to establish a reputation as friends and advocates of copyright holders.”).
23. Metro-Goldwyn-Mayer Studios Inc. v. Grokster, Ltd., 545 U.S. 913, 936-937 (2005) (“[O]ne who distributes a device with the object of promoting its use to infringe copyright, as shown by clear expression or other affirmative steps taken to foster infringement, is liable for the resulting acts of infringement by third parties.”).
24. http://tranche.proteomecommons.org/users/quickstart-upload.html (last visited Aug. 28, 2008).
25. Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417, 442 (1984).
26. Opderbeck, supra note 9, at 218. (“[A]n open source biotechnology model likely will do little to facilitate long term, significant innovation.”)
I. The curiously chameleonic properties of data ownership
Image Protein models by Alan Wolf.
Used under a Creative Commons BY-NC-CA 2.0 license.
Data 1 is both the primary output as well as the most vital input of the scientific process. In fact, data sharing performs such a key role2 that without a commons based on publicly shared data, scientific progress would surely suffer.3 In addition, data forms the foundation for downstream commercial applications aimed at privatizing the fruits of the scientific enterprise.4 Yet, despite their importance, data ownership rules are subject to a unique, inchoate IP regime which is neither copyright, patent, nor trademark. Moreover, these rules change over time, depending on whether the data has been published. Prior to publication, most data is treated as proprietary and secret.5 At this early stage, data sharing is governed by informal norms, which are enforced, if at all, under a minimal, liability rule-based legal infrastructure.6 After publication, data loses its protected status and becomes a part of the public domain. At this later stage, data sharing comes under a default rule of open and free access.Used under a Creative Commons BY-NC-CA 2.0 license.
The Supreme Court has confirmed that copyright does not, and was not meant to, protect published data.7 The Court's rationale rests on principles that uphold the commons. “The very object of publishing a book on science or the useful arts is to communicate to the world the useful knowledge which it contains. But this object would be frustrated if the knowledge could not be used without incurring the guilt of piracy of the book.”8 In spite of the commitment to open access after publication, post-publication privatization inevitably leads to interactions between upstream data sharing and exclusive IP rights.9
II. Data sharing under stress
This unique data ownership IP regime, which has arguably been in place for centuries,10 is coming under increasing pressure in two separate ways. The rate at which the commons is becoming privatized is decreasing the incentives to share, while the data itself, due to the growing size and complexity of outputs, is becoming harder to disseminate. Scholars have noted that accelerating commercialization of downstream inventions has undermined informal sharing norms for unpublished data.11 At the same time, technological advances have caused “fundamental shifts in the practices and structures of scholarly communication”12 as data has “become more complex, more extensive, more elaborate [and] more community-based.”13 These disparate sources of stress have combined to bring about a “sea change” in the “nature of biological inquiry” and scientific norms related to data sharing.14 As a result, the science commons has not benefited from the Internet-enabled efficiency gains which have brought about tremendous advances in the applied technology and commercial spheres such as those attained by Google in its search engine implementation.15
This post limits itself to analyzing the liability ramifications of a technological solution to the second problem – that of burgeoning datasets of increasing size and complexity (“BDISCs”) as obstacles to scientific progress. A digital infrastructure that allows wide-spread sharing of BDISCs throughout the scientific community may contribute to the future shape of the commons in ways that go beyond simply fixing the problem at hand by causing the scientific community to reconsider and revamp the rules of data ownership in both the pre- and post-publication stages, but that subject matter is better left for exploration at a later time.
III. Tranche: A peer-to-peer (P2P) data sharing solution to the BDISC problem
Though certain specialized disciplines have already implemented norms of data sharing, enforced by either journal editors or policy guidelines,16 technological solutions for enabling access to and propagating BDISCs have lagged behind. Thus, at the cutting edges of research, in areas such as genomics and proteomics, the BDISC problem is the most severe. Such projects are growing more complex and interdisciplinary and are generating increasingly larger and richer datasets.17 It is here that the scientific community's need to share, access, and annotate data is the greatest.
In the proteomics arena, one technological solution to this problem has been to deploy a novel, P2P based network for data sharing.18 Tranche combines the sharing efficiency and scalability of a BitTorrent network with a secure, encrypted storage system that allows data owners to retain control of disclosure.19 As a free, open source tool, Tranche has gained acceptance in the community and, as of this writing, is hosting several thousand proteomics-related datasets.20 It is worth adding that though Tranche was developed to address problems of data sharing in the proteomics context, the solution it embodies should be generally applicable across all scientific disciplines.
Tranche promises to enhance and change the manner in which the science commons is constructed. It enables temporal persistence of large-scale data and its associated identifiers and annotations, a very desirable improvement which is otherwise challenging to implement under the traditional print-based data sharing systems. It allows instant and widespread data sharing and gives the data owner the ability to choose from various licenses under which data is shared.21 Content owners can further control access by selective distribution of decryption keys and allowing access to either the entire community or a designated group or individuals. Thus, once granted, access privileges can also be amended over time. In this way, Tranche permits the continued operation of the pre-publication, informal rules of data sharing as well as the post-publication commitment to the public domain. Fully exploring the impact of an infrastructure that allows access to data by the entire scientific community in an immediate, efficient, near-zero cost manner at all stages of publication is, as mentioned, beyond the scope of this post. I will instead end with a brief examination of the liability ramifications of deploying a free P2P network for data sharing.
IV. Liability ramifications of widespread use of P2P scientific data sharing networks
Due to involvement in several high-profile copyright infringement cases (where the operators lost), P2P networks do not currently enjoy a good reputation with content owners.22 One obstacle Tranche faces in gaining widespread acceptance could therefore lie in the perception that it might potentially enable the sharing of infringing content. Liability for Tranche operators would be based on a theory of inducement to infringe, as articulated by the Supreme Court in its landmark P2P decision.23 Tranche has several features that undercut such an inducement theory. First, unlike the defendants in Grokster, Tranche operators neither advertise nor otherwise encourage infringing activity. In fact, the opposite is true - Tranche is first and foremost a tool to share large datasets. Any sharing activity not related to this primary goal would be unwelcome since it would degrade the network's performance. Tranche is also well-suited to removing infringing content and users (provided that they can be identified) since uploading privileges are granted at the discretion of the operator and may be revoked.24 Finally, in a long line of cases involving technological tools capable of being used in a non-permissible manner, courts have recognized that liability does not attach to the developer of the tool “if the product is widely used for legitimate, unobjectionable purposes. Indeed, it need merely be capable of substantial noninfringing uses.”25 There is no question that Tranche is not merely capable of such use, but that a substantial portion of its actual use is for legitimate purposes. Therefore, any liability concerns for deploying and/or using Tranche based on its P2P nature should be minimal.
V. A digitalized science commons
The manner in which scientists share data has a vitally important influence on the shape and scope of the science commons. In other words, the commons is shaped both by the rules under which it operates and the technology that enables it. Though existing rules and technology have stayed fairly constant for a very long time, both are under pressure to change from various quarters. Tranche is a P2P solution that utilizes Internet technology to modernize data sharing at a fundamental level. As either Tranche or a similar tool gains widespread acceptance and use in the community, the scientific commons will take a big step towards becoming entirely digital. Even though Tranche fits in with and allows the continued operation of existing rules of data sharing, it also provides more options and flexibility to both producers and consumers of the commons. This, in turn, will also almost certainly require a re-thinking of the rules governing data sharing. Tranche's technological capabilities should allow the community to move with equal ease towards a more market-based model favoring privatization, as advocated by at least one scholar,26 or to stick with and expand upon the commons ideal.
1. For the purposes of this post, “Data” may be defined as “experimental observations, results and related research methodologies.” See also California Institute for Regenerative Medicine, Intellectual Property Policy for Non-Profit Organizations, 2-3, (PDF)(defining “Data” and “Biological Materials”).
2. Comm. on Responsibilities of Authorship in the Biological Scis., Nat'l Research Council, Sharing Publication-Related Data and Materials, 1, 21 (2003), available at http://books.nap.edu/openbook.php?record_id=10613 [hereinafter Sharing] (“The publication of experimental results and sharing of research materials related to those results have long been key elements of the life sciences.”); John Wilbanks, Cyberinfrastructure For Knowledge Sharing, Ctwatch Quarterly, Aug. 2007, [hereinafter Cyberinfrastructure] (“Knowledge sharing is at the root of scholarship and science.”).
3. Sir Isaac Newton's aphorism “If I have seen further, it is by standing on ye sholders of Giants,” is often quoted as embodying this principle, but the origins of the concept precede him. See Robert K. Merton, On the Shoulders of Giants, 9 (Univ. of Chi. Press 1993), available publicly at http://books.google.com/books?id=o90uC4jMw1EC.
4. Richard R. Nelson, The market economy, and the scientific commons, 33 Res. Pol'y. 455, 457-59 (2004).
5. The exception is large-scale government funded projects with formalized data sharing goals. J.H. Reichman & Paul F. Uhlir, A Contractually Reconstructed Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment, 66-SPG L. & Contemp. Probs. 315, 333-36 (2003).
6. See id. at 349-51 (discussing the legal regime that governs the zone of informal data exchange amongst scientists).
7. See Feist Publ'ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 349-350, (1991) (“[R]aw facts may be copied at will. This result is neither unfair nor unfortunate. It is the means by which copyright advances the progress of science and art.”).
8. Id. at 350, quoting Baker v. Selden, 101 U.S. 99, 103 (1880).
9. David W. Opderbeck, The Penguin's Genome, or Coase and Open Source Biotechnology, 18 Harv. J. L. & Tech. 167, 173-87 (2004) (discussing the various ways recent changes in IP protections effect the biotechnology commons).
10. Sharing, supra note 2, at 27; Cyberinfrastructure, supra note 2 (“[T]his system has served science extraordinarily well over the more than three hundred years since scholarly journals were birthed.”).
11. Rebecca Eisenberg, Proprietary Rights and the Norms of Science in Biotechnology Research, 97 Yale L.J. 177, 177 (1987). Nelson, supra note 4, at 455.
12. Clifford Lynch, The Shape of the Scientific Article in The Developing Cyberinfrastructure, Ctwatch Quarterly, Aug. 2007, http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/
13. Id.
14. Nat'l Research Council, Reaping the Benefits of Genomic and Proteomic Research, 1 (2006), available at http://books.nap.edu/openbook.php?record_id=11487 [hereinafter Reaping].
15. Cyberinfrastructure, supra note 2 (“The materials that underpin [data], are 'dark' to the Web, invisible, and not subject to the efficiency gains we take for granted in the consumer world.”).
16. See, e.g. Sharing, supra note 2, at 4; Lynch, supra note 12 (“[S]pecific communities . . . have established norms, enforced by the editorial policies of their journals, which call for deposit of specific types of data within an international system of data repositories.”).
17. Reaping, supra note 14, at 42.
18. Tranche Project Homepage - Secure Scientific Data Dissemination, http://tranche.proteomecommons.org/, (last visited Aug. 28, 2008).
19. Id.
20. Id.
21. Tranche Project - Quick Start: Uploading Datahttp://tranche.proteomecommons.org/users/quickstart-upload.html, (last visited Aug. 28, 2008).
22. Mark G. Tratos, The Impact of the Internet & Digital Media on the Entertainment Industry, 896 Prac. L. Inst. 133, 234 (2007) (“[W]here peer-to-peer filing sharing companies were once ignoring (and in some cases, promoting) illegal file-sharing, these same companies are scrambling to establish a reputation as friends and advocates of copyright holders.”).
23. Metro-Goldwyn-Mayer Studios Inc. v. Grokster, Ltd., 545 U.S. 913, 936-937 (2005) (“[O]ne who distributes a device with the object of promoting its use to infringe copyright, as shown by clear expression or other affirmative steps taken to foster infringement, is liable for the resulting acts of infringement by third parties.”).
24. http://tranche.proteomecommons.org/users/quickstart-upload.html (last visited Aug. 28, 2008).
25. Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417, 442 (1984).
26. Opderbeck, supra note 9, at 218. (“[A]n open source biotechnology model likely will do little to facilitate long term, significant innovation.”)
Labels: databases, p2p, scholarship