Every year since 1946, the United Nations General Assembly has begun with the General Debate.
The General Debate consists of the leaders of all UN member states delivering an annual address in which they emphasize the issues in global politics they regard as the most important, they reveal their positions on these issues, and seek to persuade other states of the merits of their perspective. These annual statements are, therefore, an invaluable source of multifaceted information for scholars of international relations, which are comparable globally over time.
However, these texts are often stored as poor quality images, preventing researchers from applying natural language processing and data science methods to these speeches. In this project, we create the UN General Debate Corpus (UNGDC), which covers the entire 1946-2022 period. The dataset is annually updated and at present contains over 10,000 speeches from 202 countries -- which includes historical countries -- making it the most comprehensive, unique, and ready-to-use collection of global political speeches.