4.4 Article

MODEL TREES WITH TOPIC MODEL PREPROCESSING: AN APPROACH FOR DATA JOURNALISM ILLUSTRATED WITH THE WIKILEAKS AFGHANISTAN WAR LOGS

Journal

ANNALS OF APPLIED STATISTICS
Volume 7, Issue 2, Pages 613-639

Publisher

INST MATHEMATICAL STATISTICS-IMS
DOI: 10.1214/12-AOAS618

Keywords

Afghanistan; count data; database data; latent Dirichlet allocation; model-based recursive partitioning; WikiLeaks; computational social science; tree stability; tree validation; text mining

Ask authors/readers for more resources

The WikiLeaks Afghanistan war logs contain nearly 77,000 reports of incidents in the US-led Afghanistan war, covering the period from January 2004 to December 2009. The recent growth of data on complex social systems and the potential to derive stories from them has shifted the focus of journalistic and scientific attention increasingly toward data-driven journalism and computational social science. In this paper we advocate the usage of modern statistical methods for problems of data journalism and beyond, which may help journalistic and scientific work and lead to additional insight. Using the WikiLeaks Afghanistan war logs for illustration, we present an approach that builds intelligible statistical models for interpretable segments in the data, in this case to explore the fatality rates associated with different circumstances in the Afghanistan war. Our approach combines preprocessing by Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process the natural language information contained in each report summary by estimating latent topics and assigning each report to one of them. Together with other variables these topic assignments serve as splitting variables for finding segments in the data to which local statistical models for the reported number of fatalities are fitted. Segmentation and fitting is carried out with recursive partitioning of negative binomial distributions. We identify segments with different fatality rates that correspond to a small number of topics and other variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This gives an unprecedented description of the war in Afghanistan and serves as an example of how data journalism, computational social science and other areas with interest in database data can benefit from modern statistical techniques.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Multidisciplinary Sciences

Motivation, values, and work design as drivers of participation in the R open source project for statistical computing

Patrick Mair, Eva Hofmann, Kathrin Gruber, Reinhold Hatzinger, Achim Zeileis, Kurt Hornik

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA (2015)

Article Computer Science, Interdisciplinary Applications

Computing a journal meta-ranking using paired comparisons and adaptive lasso estimators

Laura Vana, Ronald Hochreiter, Kurt Hornik

SCIENTOMETRICS (2016)

Article Statistics & Probability

Assessing and Quantifying Clusteredness: The OPTICS Cordillera

Thomas Rusch, Kurt Hornik, Patrick Mair

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS (2018)

Article Statistics & Probability

On standard conjugate families for natural exponential families with bounded natural parameter space

Kurt Hornik, Bettina Gruen

JOURNAL OF MULTIVARIATE ANALYSIS (2014)

Article Environmental Sciences

Generalized Sparse Convolutional Neural Networks for Semantic Segmentation of Point Clouds Derived from Tri-Stereo Satellite Imagery

Stefan Bachhofner, Ana-Maria Loghin, Johannes Otepka, Norbert Pfeifer, Michael Hornacek, Andrea Siposova, Niklas Schmidinger, Kurt Hornik, Nikolaus Schiller, Olaf Kaehler, Ronald Hochreiter

REMOTE SENSING (2020)

Article Statistics & Probability

Cluster Optimized Proximity Scaling

Thomas Rusch, Patrick Mair, Kurt Hornik

Summary: This paper introduces the Cluster Optimized Proximity Scaling (COPS) method, aiming to find a low-dimensional configuration with clusteredness to improve the clustering of objects, enabling visual identification of clusters of mental states.

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS (2021)

Article Statistics & Probability

A comparison of optimization solvers for log binomial regression including conic programming

Florian Schwendinger, Bettina Grun, Kurt Hornik

Summary: This paper systematically compares different optimization algorithms to obtain the maximum likelihood estimates for the regression coefficients in log-binomial regression, finding that conic optimizers emerge as the preferred choice due to their reliability, lack of requirement to tune hyperparameters, and speed.

COMPUTATIONAL STATISTICS (2021)

Article Computer Science, Interdisciplinary Applications

Is it all bafflegab? - Linguistic and meta characteristics of research articles in prestigious economics journals

Julian Amon, Kurt Hornik

Summary: This paper takes an alternative approach to studying the factors associated with scientific prestige by examining the relationship between linguistic and meta characteristics of academic papers and the rankings of the journals they appear in. The study uses text mining tools to extract features from a large corpus of economics journal articles and estimates regression models to analyze the relationship between these features and journal rankings. The results identify several predictors, including paper length, coreference chain span, writing style, density of the article, collaboration in research teams, and references cited, as the most informative drivers of scientific prestige.

JOURNAL OF INFORMETRICS (2022)

Article Business, Finance

A corporate credit rating model with autoregressive errors

Rainer Hirk, Laura Vana, Kurt Hornik

Summary: This paper proposes a longitudinal credit rating model that considers the serial correlation in ratings. By adding an autoregressive structure to a multivariate ordinal regression model, the model significantly improves the goodness-of-fit and predictive performance compared to static models. The model allows for conditional predictions based on a firm's past rating history, outperforming unconditional predictions in both in-sample and out-of-sample scenarios. Additionally, the model is capable of handling missing rating observations. An empirical analysis using US publicly traded corporates rated by S&P from 1985-2016 shows that S&P exhibits procyclical aspects in their rating behavior.

JOURNAL OF EMPIRICAL FINANCE (2022)

Article Economics

Bivariate jointness measures in Bayesian Model Averaging: Solving the conundrum

Paul Hofmarcher, Jesus Crespo Cuaresma, Bettina Gruen, Stefan Humer, Mathias Moser

JOURNAL OF MACROECONOMICS (2018)

Article Economics

Unveiling covariate inclusion structures in economic growth regressions using latent class analysis

Jesus Crespo Cuaresma, Bettina Gruen, Paul Hofmarcher, Stefan Humer, Mathias Moser

EUROPEAN ECONOMIC REVIEW (2016)

Article Economics

Last Night a Shrinkage Saved My Life: Economic Growth, Model Uncertainty and Correlated Regressors

Paul Hofmarcher, Jesus Crespo Cuaresma, Bettina Gruen, Kurt Hornik

JOURNAL OF FORECASTING (2015)

Article Computer Science, Interdisciplinary Applications

movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions

Kurt Hornik, Bettina Gruen

JOURNAL OF STATISTICAL SOFTWARE (2014)

Correction Anthropology

Making friends and communicating on Facebook: Implications for social capital (vol 37, pg 29, 2014)

Angela Bohn, Christian Buchta, Kurt Hornik, Patrick Mair

SOCIAL NETWORKS (2014)

No Data Available