The modern digital landscape that facilitates real-time collaboration has also created an enormous attack surface. Today, documents containing malicious constructs are a leading cause of data breaches in the government and private sectors .
Warnings about the dangers of suspicious files abound, yet people’s jobs depend on documents – reading and reacting to them, filling out forms, etc. Even if one can authenticate the immediate provider of the document, its contents may originate from an untrusted source and may contain malicious data payloads.
Researchers with DARPA’s Safe Documents (SafeDocs) program have developed new methods and tools that allow people to confidently open documents and trust what they see on their screens.
Kicking off in 2018, SafeDocs began with a goal to improve the security of electronic communication, particularly in sensitive or critical applications such as military or government operations. Since then, SafeDocs research and development have reduced the complexity of documents’ formats, which are the rules that documents must obey so that software can open them. In addition, teams radically improved software’s ability to reject invalid and malicious data without impacting the core functionality of new and existing electronic data formats. SafeDocs tools have also helped preserve electronic document history and keep feature-rich electronic documents viable.
“Today, electronic data is the attack surface,” said Dr. Sergey Bratus, DARPA’s SafeDocs program manager in the Information Innovation Office. “Attackers abuse excessive complexity and ambiguity of document format rules to sneak in malicious payloads past the scanners. SafeDocs’ formal methods approach helps uncover and eliminate the dark corners where the attackers love to hide. Resulting technologies make trusting incoming data via documents viable for many industries, including those dealing with critical infrastructure.”
DOCUMENT SECURITY 101 – IF YOU CAN’T DEFINE IT, YOU CAN’T DEFEND IT
Document formats are quite complex. One might think of a document as an inert piece of digital paper, but it comprises many technical features. Those features interact with the complexities within the software that interprets the document and renders it on the screen.
The complexities in file formats create opportunities for attackers to hide. Current software that processes digital data such as documents, messages, and data streams are error-prone and vulnerable to exploitation by malicious inputs.
Complexity can also lead to ambiguity and misunderstanding, presenting opportunities to attackers who can manipulate data in complex and confusing data formats. For instance, the widely used Portable Document Format (PDF) specification is 1,000 pages of English text with over 70 normative references to other documents, many of which have voluminous, normative references of their own.
The PDF specification’s sheer size and complexity can and has led to varying interpretations. Research suggests that despite official standards, most implementations follow de facto standards defined by file malformations deemed benign and supported by other permissive software. According to a series of recent papers , a PDF file containing encrypted data could be manipulated to exfiltrate the data to a specified location when the user interacts with the document. Even cryptographically-signed PDF files could be manipulated to make a fake signature appear valid or a tampered file appear intact. Furthermore, malicious payloads included in PDF files could be hidden from security scanning software.
SafeDocs performers developed methodologies and tools for capturing and defining human-intelligible, machine-readable descriptions of electronic data formats to address the ambiguity and complexity of file formats. Performing teams also created automated software construction kits for building secure, verified scanners using the simplified format subsets where the existing format’s inherent complexity or ambiguity had been reduced for safety.
According to Bratus, this approach strikes at the root cause of scanner vulnerabilities: the programmers’ errors due to misreading the format’s rules or failing to check them correctly. To correctly implement a scanner for a modern document format such as PDF, a programmer must understand the thousands of rules and their interactions and ensure the code checks them all, an impossible task even for the most careful programmer.
“Acting on an unchecked assumption is the recipe for code vulnerability,” said Bratus. “SafeDocs helps the programmer avoid implementation errors due to misunderstanding or accidental omission by generating the code automatically.”
PDF’s wide-spread usage, complexity, occasional ambiguity, and the diversity of implementations prompted DARPA to engage the PDF Association. The PDF Association is the umbrella organization representing the PDF technology ecosystem, including companies such as Adobe, DocuSign and Foxit, stakeholders such as Boeing, free software projects such as LaTeX, and government agencies such as the U.S. National Archives and the U.S. Library of Congress. DARPA sought to use the format as a test and demonstration vehicle for SafeDocs performers to create the systems, tools and specifications to help enhance the security of PDF and other digital document formats.
Together, the PDF Association and other SafeDocs performers  took on a critical challenge – create unambiguous definitions that help computers reason about document formats and use automatically generated scanners to reject maliciousness and avoid confusion caused by ambiguity. As a result, they have accomplished the following:
- Filed 117 disambiguating edits to the international standard for PDF (ISO 32000-2 AKA PDF 2.0), 88 of which have been fully resolved and approved by ISO with solutions publicly available;
- Developed the Arlington PDF Model, the first vendor-neutral, open-source specification-derived, machine and human-readable definition of the PDF data objects;
- Completed a security audit of the International Color Consortium’s (ICC) color profile format used in PDF and many image formats, resulting in updates to the ICC specifications and a move to incorporate machine-readable data descriptions to assist implementers. ICC color profiles are integral to the accurate rendering of images and can be used for malicious purposes, as River Loop Security and the PDF Association describe in this analysis;
- Identified the need and directed the curation of a new PDF file corpus, CC-MAIN-2021-31-PDF-UNTRUNCATED, to support research and format awareness; and
- Generated automatic tests/parsers for coding to address human error and reduce work time from three years to one day.
“DARPA and the PDF Association are helping standards organizations redefine software specifications and even standards development processes that could help mitigate billions  of dollars in terms of loss of productivity caused by data breaches,” said Bratus. “Through our collaborative efforts, we’ve shown the ability to eliminate the root cause of ambiguity, the place for the attackers to hide within the complexity of modern documents.”
THE FUTURE OF SAFEDOCS GOES BEYOND ‘DOCS’
Bratus envisions expanding SafeDocs solutions beyond documents to other file formats, such as those used for operating cars and military systems, streaming video and beyond.
If every data format could be designed with SafeDocs tools, we’d significantly reduce systems’ vulnerabilities to crafted malicious data attacks,” he said.
As such, DARPA is in the process of transitioning tools to government partners.
As part of its third phase, the Open Group Sensor Open Systems Architecture (SOSA) Consortium has been exploring SafeDocs data modeling technologies for incorporation within the SOSA standard. The SOSA approach establishes guidelines for Command, Control, Communications, Computers, Cyber, Intelligence, Surveillance and Reconnaissance (C5ISR) systems. The objective is to allow flexibility in the selection and acquisition of sensors and subsystems that provide sensor data collection, processing, exploitation, communication, and related functions over the full life cycle of the C5ISR system.
The Electronic Records Processing Branch at the National Archives and Records Administration (NARA) has also benefited from SafeDocs. As a SafeDocs performer, the NASA Jet Propulsion Laboratory (JPL) improved one of NARA’s tools, Apache Tika, which automatically identifies embedded and corrupt files and extracts text and critical metadata from PDFs to understand file features and provenance – an essential function for digital file preservation. According to one senior IT specialist at NARA, using the improved Apache Tika toolkit helps them accomplish tasks more efficiently and safely. In addition, the specialist said their team is successfully using the updated tool to expedite the processing of large record sets and find new ways to process records more efficiently.
Furthermore, the PDF Association and DARPA’s Embedded Entrepreneur Initiative performer Galois and other performers continue to focus on transitioning SafeDocs format insights and approaches to industry and international standards bodies. The agency also encourages industry to adopt its solutions, as seen in this industry example, in which one company describes its application of the Arlington model to improve its PDF creation software.
SAFEDOCS TOOLS AVAILABLE TODAY
The following tools can help software developers and cybersecurity/privacy researchers improve their organization’s security posture in handling electronic documents. These range in functionality and specificity for a variety of uses. Check each description and click on the tool’s links for additional information.
Resources for the Portable Document Format (PDF):
Programmer resources for describing data formats and auto-generating parsing code:
Tools for understanding document collections and format rules:
Tools to understand behavior of existing parser code:
Tools to understand behavior of existing parser code: