About

Row

Original Tweet Count

2,996,979

Final Tweet Count

511,675

Date Selected

July 1st, 2021

Row

Methodology

Our project aimed to characterize the public opinion of the COVID-19 pandemic by applying machine learning on COVID-related tweets. Our methodology is detailed below:

  1. We queried the pre-curated dataset of COVID-related tweets published by Chen et al. in JMIR for those tweets posted on July 1st, 2021. A total of 2,996,979 tweets were identified.

  2. We filtered this initial dataset for tweets which were written in the English language and which were not retweets (i.e., were original content). The resulting 540,642 tweets were hydrated in Python using the Twitter API.

  3. The 511,675 successfully hydrated tweets were parsed from JSON/HTML and cleaned in R, followed by feature extraction (e.g., hashtags, URLs, replies, retweets, location, etc.).

  4. Finally, we used natural language processing tools including structural topic modeling to derive aggregate features from our dataset.

Analyses were performed in Python and R. All code is available via our GitHub repository.

Topic Modeling

Tweet preprocessing was performed using a wrapper to the tm package. Briefly, extra white space was stripped; numbers, stop words, punctuation, and low-frequency terms were removed; and words were stemmed using snowball stemmers.

After constructing the tweet term matrix and the vocabulary index of words in the corpus, we then used the stm package to estimate a structural topic model (STM) using semi-collapsed variational EM. STMs permit the study of interaction betweeen tweet-level covariates (from feature extraction) and topical prevalence and/or content. We use spectral initialization and applied the algorithm of Lee and Mimno (2014) to estimate the number of topics. A maximum of 100 EM iterations were permitted; if convergence was not met at this point, the model was discarded.

The resulting topics were examined, and topics of interest were selected for further analysis. For each topic, top tweets ranked by the MAP estimate of the topic’s theta value (which captures the modal estimate of the proportion of word tokens assigned to the topic under the model) were identified. Representative tweets are displayed here.

Row

Retweeted Tweets

Favorited Tweets

Topics

Topic 3: mask, wear, still, social, distanc, peopl, keep



Topic 3 pertains to masking and social distancing. Representative tweets are displayed here. Please note that tweets have not been filtered for objectionable content, and presentation here does not imply endorsement.

Word profiles for Topic 3 are reported below based on several metrics, including highest probability, FREX (weighted by overall frequency and topic exclusivity), lift (weighted by dividing by word frequency in other topics), and score (log frequency of the word in the topic divided by the log frequency of the word in other topics).

Highest Probability: mask, wear, still, social, distanc, peopl, keep

FREX: thomaskain, wear, mask, swampwitchleigh, taal, sassykadik, mask…

Lift: risingradio, sosupersam, buymapa, sbxatinbentesaryo, thatid, markzli, maskimo

Score: mask, wear, still, distanc, social, mandat, indoor

Topic 5: vaccin, avail, sign, appoint, near, dose, covid-



Topic 5 pertains to vaccination; specifically, vaccine scheduling and availability. Representative tweets are displayed here. Please note that tweets have not been filtered for objectionable content, and presentation here does not imply endorsement.

Word profiles for Topic 5 are reported below based on several metrics, including highest probability, FREX (weighted by overall frequency and topic exclusivity), lift (weighted by dividing by word frequency in other topics), and score (log frequency of the word in the topic divided by the log frequency of the word in other topics).

Highest Probability: vaccin, avail, sign, appoint, near, dose, covid-

FREX: walgreensduan, walgreen, zip, appoint, cvs, berwyn, skoki

Lift: 𝐔𝐏𝐃𝐀𝐓𝐄, 𝐕𝐚𝐜𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧, ઉપર, વર્ષ, વર્ષથી, activenland, alnwickgazett

Score: appoint, avail, dose, sign, vaccin, near, jul

Topic 9: trump, american, money, biden, vote, presid, cost



Topic 9 contains tweets discussing politics and COVID-19. Representative tweets are displayed here. Please note that tweets have not been filtered for objectionable content, and presentation here does not imply endorsement.

Word profiles for Topic 9 are reported below based on several metrics, including highest probability, FREX (weighted by overall frequency and topic exclusivity), lift (weighted by dividing by word frequency in other topics), and score (log frequency of the word in the topic divided by the log frequency of the word in other topics).

Highest Probability: trump, american, money, biden, vote, presid, cost

FREX: capitol, deficit, tax, republican, incit, cost, infrastructur

Lift: andymoor, cjuliasm, dinosaurtrix, gonnaeeeno, scotsmistt, deathstrump, gridsinfrastructur

Score: trump, vote, american, biden, tax, cost, republican

Topic 14: doctor, thank, pandem, doctorsday, medic, fight, day



Topic 14 contains tweets expressing gratitude to doctors and frontline healthcare workers. Representative tweets are displayed here. Please note that tweets have not been filtered for objectionable content, and presentation here does not imply endorsement.

Word profiles for Topic 14 are reported below based on several metrics, including highest probability, FREX (weighted by overall frequency and topic exclusivity), lift (weighted by dividing by word frequency in other topics), and score (log frequency of the word in the topic divided by the log frequency of the word in other topics).

Highest Probability: doctor, thank, pandem, doctorsday, medic, fight, day

FREX: fmgereduceto, help🙏, aashiwini, “medicines💊, fmge, doctordoctor, salutedoctor

Lift: ‘mankind’, “medicines💊, 𝐷𝑎𝑦👨‍⚕️🥰💓, 𝐷𝑜𝑐𝑡𝑜𝑟𝑠, 𝐻𝑎𝑝𝑝𝑦, 𝗗𝗼𝗰𝘁𝗼𝗿’𝘀, 👍drharshvardhan

Score: doctor, doctorsday, nationaldoctorsday, thank, salut, happydoctorsday, fight

Topic 34: variant, peopl, delta, vaccin, risk, cdc, covid



Topic 34 pertains to the COVID-19 variants, including and especially the Delta variant. Representative tweets are displayed here. Please note that tweets have not been filtered for objectionable content, and presentation here does not imply endorsement.

Word profiles for Topic 34 are reported below based on several metrics, including highest probability, FREX (weighted by overall frequency and topic exclusivity), lift (weighted by dividing by word frequency in other topics), and score (log frequency of the word in the topic divided by the log frequency of the word in other topics).

Highest Probability: variant, peopl, delta, vaccin, risk, cdc, covid

FREX: hypertransmiss, walenski, cnet, edyong, unvaccin, trvrb, delta

Lift: ampquotthi, youampquot, shelbymccowan, bannedword, cdcword, doubleplusgoodspeak, trumphasdementia

Score: variant, delta, cdc, risk, vaccin, peopl, unvaccin

References

Row

References

  1. Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) “The structural topic model and applied social science.” In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.

  2. Roberts M., Stewart, B. and Airoldi, E. (2016) “A model of text for experimentation in the social sciences” Journal of the American Statistical Association.

  3. Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). Structural topic models for open ended survey responses. American Journal of Political Science, 58(4), 1064-1082.

  4. Roberts, M., Stewart, B., & Tingley, D. (2016). “Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry.” New York: Cambridge University Press.

  5. Lee, M. and Mimno, D. (2014) “Low-Dimensional Embeddings for Interpretable Anchor-based Topic Inference.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1319–1328.

  6. Taddy, M. (2013). “Multinomial Inverse Regression for Text Analysis.” Journal of the American Statistical Association, 108(503), 755–770.