Recent geopolitical events in Ukraine and the Middle East have highlighted the growing volatility of the world today. The emergence of states like Brazil, China and Russia underline how the world is moving to a more competitive place, with power more evenly distributed on a global scale. This structural shift in power distribution away from a consolidation of power in the West has been coined as the move from a unipolar to a multipolar world.
From another dimension, though, threats to governments and private sector organizations are increasingly fragmenting away from states and the traditional contours of sovereignty, and into the realm of entrepreneurial terrorist organizations.
Both of these shifts have implications for intelligence gathering in both the private and public sector.
Set against this changing threat landscape is the opportunity presented by new technology to gain more predictive intelligence about emerging threats to geopolitical stability. The recent tendency for regional conflagrations to spring up and surprise organizations raise the question of how much of these events are now predictable with the advent of Big Data.
Traditionally, risk identification and analysis has been mostly qualitative, performed by expert analysts covering a particular region who collate information themselves and then interpret and disseminate their findings. This is often a three-part intelligence process encompassing data collection, analysis and dissemination.
Investments in analytic technology
The intelligence failings that were exposed in the aftermath of 9/11, and then again during the Arab Spring, focused on the deficiencies of the analysis stage of this three stage methodology. The hypothesis was that because independent datasets were heavily siloed, it was hard to see connections between different types of data, research themes and regions. The failure to co-mingle different types of data meant that connections remained latent, rather than visible, ultimately resulting in negative surprises.
To address this issue, data fusion technology investments were inaugurated which involved putting in place technologies that could sit on top of various data stores and draw connections between events and entities through link and network analysis to, for example, identify possible terrorist cells from transactional data. Taking advantage of newly swelled defense budgets following 9/11, companies like i2, Predpol and Palantir built analytic systems to try and address this issue. By assembling the analytic architecture to support an iterative intelligence cycle, the idea was that more connections and patterns could now be seen from the data and more insight therefore derived.
New data, new opportunities
However, while the investment in flexible analytic technology resulted in more visibility in the connections between data points, it did not address growing informational deficiency — specifically, surfacing hard to find low visibility information to show what was happening now and what might happen in the future. Thus, as more and more devices and platforms pump out situational information on a second by second basis, this information remains largely untapped to the detriment of the intelligence gathering process.
At a macro level, the decline of newspapers and the emergence of peer-to-peer information sharing platforms has fundamentally reconfigured where intelligence is situated and traditional conduits of knowledge are exchanged. Now, information moves at a lightning fast pace, with social media platforms out-sprinting publishing organizations in the production and dissemination of reports. The result is that the open web has become a reservoir of insight and a fossil layer for all content ever generated. We now require new ways to surface and explore this data at scale.
Until now, collecting this type of data was an extremely difficult and time consuming process, involving the manual aggregation of hundreds of new articles everyday by human event handlers and analysts to spot new developments. The joint proliferation and fragmentation of textual content has meant there is both more information to wade through and a greater variation of content. All this means analysts need to spend longer time on data collection, giving them less time for analysis, interpretation and their point of view.
A recent example demonstrated this problem: a predictive tweet posted by an Islamic State of Iraq and Syria (ISIS) activist not picked up by anyone which may have given a public warning that ISIS sympathizers were preparing an attack on the border with Yemen. A few hashtags began circulating in early June relating to Saudi security efforts targeting Al Qaeda in the region of Sharurah. Using one of these hashtags, one Twitter account posted: “In Sharurah [we have] our greatest knights and suicide bombers. They will commit a suicide attack in the police investigation building with the help of God.”
Two types of technical problems involved with this type of Open Source Intelligence (OSINT) data are worth highlighting. The first is to identify the relevant items of information and collecting the data to remove it from its original source. The second part is presenting data in the way which allows analytical investigations to yield insightful results on an ongoing, dynamic basis. This is about providing data that can be queried in a way that is malleable, reusable and extensible.
In terms of the first challenge, while it can be costly to collect and store data, new advancements in data storage and relational databases mean this is now less of an issue. Indeed, recent allegations by Edward Snowden suggest that bringing in targeted data streams at scale has already been undertaken by governments with relative ease.
The significantly more challenging and valuable problem is extracting vital fields of information from unstructured text that can yield insight — in effect, removing the noise and secondary data and preserving only the vital parts (such as location, threat classification, date and actors). Essentially, this means transforming unstructured textual data into coherent data formats which can be organized and queried in multiple dimensions.
The clear advantage of this type of data is its reusability: traditional qualitative analysis can be used once to answer a single question, whereas big data can be switched around multiple times to answer different types of questions iteratively — show me all terrorist attacks in Algeria; show me whether this is more or less than the regional norm; now show me attacks using improvised explosive devices in Algeria, etc.
A new algorithmic technique that can solve this issue is event extraction using natural language processing. This involves algorithms discovering particular items of information from unstructured text. This could include certain risk events (protests, insurgency, strikes, bomb attacks) combined with locational and temporal context.
Context can be provided by different types of extraction: geo-extraction (identifying locations from unstructured text), time extraction (identifying time from unstructured text), event extraction (identifying different types of events from unstructured text), and actor extraction (identifying different types of events from unstructured text).
Natural language processing works by identifying specific words (often verbs) in unstructured text that conform to a classification scheme. For instance, “protest,” “demonstrate,” “boycott,” “riot,” “strike” and variants all signify events relating to civil disorder. With statistical machine translation, these verbs can be identified in languages ranging from Arabic to Mandarin, giving a global coverage of civil disorder events.
The clear advantage of this approach is a real-time way to discover threat events hidden within the open web that are relevant to particular intelligence products and correspond to pre-defined parameters. Rather than personally monitoring a host of websites and data feeds on a 24/7 basis, intelligence analysts can set the parameters that are relevant to them and use algorithms to discover, extract and understand the events.
The monitoring is performed by algorithms, allowing analysts to focus on the analysis side of the equation — saving them time and allowing them to deploy their resources toward more high value pursuits. Augmenting the analytic capability of analysts by delivering real-time data in a quantifiable and organized environment is the objective. This gives organizations early warning about low visibility threats, affording them time to conceive proactive mitigation strategies.
Furthermore, given the verbosity and denseness of text, it is also extremely difficult for human analysts to wade through text and link events to times and dates and locations and actors. Performed at scale, this is best achieved using algorithms which can, for instance, identify all the possible dates which relate to a specific event in an article, and then choose the most likely one based on a set of predefined rules constructed algorithmically and refined using machine learning — a technique by which algorithms can learn and improve based on past performance.
Disaggregating events into different buckets (location, time, types, actor) enables precise and surgical queries to be run — for example, recent incidents of protest in northern Algeria in a short period of time. As this data is in a quantitative format, it can also be exported to various visualization tools such as Tableau, CartoDb and Tipco to show trends and patterns in the data. A recent case study we performed with clients at Cytora looked at the spatial spread of Boko Haram activity from 2012-2014.
By running advanced queries, we were able to limit the data to just events that related to Boko Haram in Nigeria and classify event data into different types, such as attacks against civilians and attacks against the military. This type of analysis — enabled by the malleability of the data — enabled subtle tactical changes to Boko Haram’s activity to be discovered.
Outside of the time saved and re-deployed elsewhere, event extraction built on natural language processing can bring to the surface events which are hard to find, latent or in irregular news sources which only periodically contain new information. Quite simply, a human analyst can only cover a certain number of sources and it makes sense to cover regular reporting outlets where the informational frequency and replenishment is high. This forms a bias against longer tail online sources (such as Facebook accounts used by the Mali Police Force, or websites reporting on troop deployment in Russia) which may be less frequent, but provide low visibility and potentially high impact events.
The advantage here in event extraction using algorithms is its inherent scalability and extensibility — the costs of monitoring new sources are far lower and don’t involve the same trade-off as a human analyst would experience in having to cover additional sources.
Once these discrete events are extracted and organized, it is possible to find valuable insight such as the number of bomb attacks in northern Algeria has increased 30 percent in the last month or the number of protests in Burma involving farmers in the last 3 months increased by 50 percent. The value of this type of quantitative analysis is clear in terms of spotting surges of instability in countries and identifying unusual changes in activity that diverge from historical norms. For instance, our analytics platform picked up a surge in ISIS activity in Syria and Iraq weeks before mainstream media became aware of it, or, indeed, even knew that ISIS was a threat.
The way forward
Open source data provides, at least theoretically, a record of recent history — what has happened across a period of time and how change has occurred. It forms a bedrock of understanding why events have happened, informing us of the critical drivers and mechanisms which have brought it into being.
Piping this open source intelligence into the right algorithmic environment in real-time can yield insight that would require hundreds of analysts to emulate in terms of physical data collection. In light of the speed, scale and flux of online information, it makes sense for both private organizations and governments to use this type of technology to augment the capabilities of their analysts.
Richard Hartley is co-founder and head of products at Cytora, where he works on product strategy and design, and closely collaborates with customers to define requirements and usability. He previously worked in product management at eBaoTech, a Chinese software company based in Shanghai. Richard has spoken at various conferences about the applications of new technology to risk methodologies.