I copied and pasted this title directly from another document. Actually it’s the one about digital plagiarism from the ORI scholarly integrity website I want to discuss. Protecting words and ideas in the digital age is like suffocating them. The problem lies almost as much in the protection itself as in the failure to reference. As long as one attributes the words and the ideas and the links and the titles to their original source, I think stealing should be celebrated. It is what drives Google’s page rank algorithm and the whole field of search engine optimization. We live in an inherently connected world. Not borrowing ideas and inspiration is ludicrous. Researchers must embrace the burden of tracing the path back to (what they best conceive as) the original source of information. It can quickly become a rabbit hole, but one we have no choice but to dive down headfirst.
But where were we? Let’s break down a paragraph from Dr. Albert Teich’s session:
Computer word processors with such peripherals as scanners and laser printers make it substantially easier to appropriate another person's text, to manipulate that text, to publish and distribute that text, to use the text in unpublished documents, such as a proposal, or to manipulate or alter graphics or even photographic images.
First of all- scanners are no longer necessary with Genius Scan, let’s be honest. Printing has its value when dealing with government bureaucracies but most documents can now live out their entire life cycle as pixels stored in a server waiting for their chance to be loaded onto a display. As far as the appropriation of another person’s text, I’ve done it now in about three different ways (sorry Albert!). First, in the title with no immediate citation except a link to the original document in the first line of text. Second, in the quote above with direct reference to the author who originally wrote the words. And third, in my photo where the title and the quote are both displayed on my computer screen and on my phone, without any reference at all. Virginia Tech now encourages software such as iThenticate for identifying plagiarism and even requires it for graduate level theses and dissertations. But the software only scans official publications from the abstract to the appendices and only looks for matches of nine words or more. It also excludes all explicit quotes to other work. If Dr. Teich’s session review was (or is) published anywhere, my blog title would be the only thing to raise a red flag because it is eleven words long and repeated in the first original paragraph of text. All the stolen text in my picture and my quote live to lie another day.
Now to the role of computers in both creating and detecting these types of blatant plagiarism. Natural language processing is the algorithmic foundation of most programs like iThenticate. Software engineers and computer scientists, working with sociologists and neuroscientists studying the perception of language (computational linguistics), have figured out how to teach computers to read written text. Once the text under consideration and the text from every article ever published in the public domain (that the program database includes) is read, computers do what they do best: compare. The explicit text part is easy. It is the graphics that present more of a challenge.
Computer vision is now helping programs to interpret text captured in a photo, but it still has a long way to go. That is why verification softwares like CAPTCHA ask you to interpret distorted text almost every time you try to make a transaction or send a message online. They want to make sure you are not a bot. Bots can’t read distorted graphical text like the quote or the keyboard in my photo, at least not yet.
All of this can be done from virtually anywhere in the world, without leaving a trace of the previous version of the document.
Sorry Albert, the datedness of your session is showing once again. When you wrote it I am sure blockchain did not exist. Hate on cryptocurrencies and the Bitcoin bubble all you’d like, but at the end of the day blockchain is about securing a digital identity. It requires a private key in order to carry out any transaction and as long as you keep that key secure, no original content can get posted or manipulated without your consent. Virginia Tech alumnus and CEO of Blacksburg’s Block.one always uses the example of Elon Musk’s twitter account. If twitter ran on a blockchain, no one could post on Elon’s account unless they had his private key. All this to say, yes; graphic and textual content can be manipulated, but the same type of technological breakthroughs that made those alterations possible are now allowing us to track and store all manipulations of content on a secure and immutable ledger. Even though people will continue to manipulate words and ideas (and graphics) t least now we’ll know who does the manipulating and when they did it.
And lastly let’s bring it back to ethics. As researchers, of course we have to be honest about where our words and ideas and everything else that inspires us comes from. Of course we have to be honest about our data sources and the statistics we create to represent them. Of course there will still be those who try to game the system; cheat and steal their way to fame and fortune. We already have the tools to start protecting ourselves from them. What we don’t is a good enough understanding of the tools themselves and the bias that we code into them. In this digital day and age I have to appreciate the work of FAT ML even more so than Dr. Teich’s sessions at ORI. We have to break down the black box and prevent our own biases from proliferating a new era of fierce computational inequality. With all this machine learning going on it seems like the computers are now studying more skewed information than the human academics their algorithms are trying to protect.