Measuring human capital with social media data and machine learning.
Journal:
Scientific reports
Published Date:
Jun 4, 2026
Abstract
Timely data on educational attainment at granular geographic levels remains scarce in many countries, limiting evidence-based policy-making. Recent advances in machine learning have enabled the use of non-traditional data sources like satellite imagery and mobile phone records to measure development indicators. While these approaches have been successful in predicting outcomes such as wealth, poverty, or population density, previous attempts to predict educational attainment have achieved only modest accuracy. Here we show that language patterns and user behavior in social media can explain up to 70 percent of the variance in regional educational attainment. Our machine learning framework leverages linguistic features, user behavior, and network characteristics from 25 million geolocated tweets from the United States and Mexico. It performs particularly well in predicting higher education levels and maintains a good performance even with limited data collection periods. These results show that digital communication patterns can serve as reliable proxies for human capital. In light of the rapid expansion of social media use around the globe, this represents a promising approach to tracking educational outcomes in regions lacking granular and timely survey data.
Authors
Keywords
No keywords available for this article.