Your website Footnote dos was used as a means to get tweet-ids Footnote step three , this web site provides researchers that have metadata from an excellent (third-party-collected) corpus of Dutch tweets (Tjong Kim Done and you may Van den Bosch, 2013). age., the latest historic maximum whenever requesting tweets based on a quest ask). New Roentgen-bundle ‘rtweet’ and you may subservient ‘lookup_status’ means were utilized to get tweets from inside the JSON format. The brand new JSON file comprises a table towards the tweets’ pointers, like the development go out, the newest tweet text message, plus the provider (i.elizabeth., sort of Fb client).
Investigation clean and you may preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The new tweet texts was in fact changed into ASCII encryption. URLs, line trips, tweet headers, display brands, and you will sources so you can screen names was basically eliminated. URLs enhance the profile matter whenever discovered from inside the tweet. not, URLs don’t add to the reputation amount if they are located at the termination of a tweet. To get rid of good misrepresentation of the real profile limitation you to definitely users suffered with, tweets having URLs ( not mass media URLs such as for example extra images or video clips) was basically excluded.
Token and you may bigram investigation
The newest R plan Footnote 5 ‘quanteda’ was used so you can tokenize the fresh new tweet messages on the tokens (i.e., separated terms and conditions, punctuation s. At exactly the same time, token-frequency-matrices was basically determined with: brand new volume pre-CLC [f(token pre)], the fresh relative regularity pre-CLC[P (token pre)], the fresh new regularity blog post-CLC [f(token blog post)], brand new relative regularity blog post-CLC and you may T-scores. The brand new T-take to is much like an elementary T-figure and you may works out the analytical difference between setting (we.age., the fresh new cousin keyword frequencies). Bad T-ratings suggest a fairly highest occurrence regarding an effective token pre-CLC, whereas confident T-scores mean a fairly high thickness of a great token post-CLC. The T-rating picture included in the analysis was showed since Eq. (1) and you can (2). Letter is the final number away from tokens each dataset (we.elizabeth., before and after-CLC). It equation is based on the procedure to have linguistic data of the Church ainsi que al. (1991; Tjong Kim Sang, 2011).
Part-of-address (POS) data
The fresh R bundle Footnote six ‘openNLP’ was used so you can categorize and you will count POS kinds on tweets (we.e., adjectives, adverbs, stuff, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and miscellaneous). This new POS tagger operates playing with a maximum entropy (maxent) possibilities model so you can expect the new POS group predicated on contextual has (Ratnaparkhi, 1996). The brand new Dutch maxent design used for the new POS class try educated on CoNLL-X Alpino Dutch Treebank research (Buchholz and you will ). The new openNLP POS design might have been claimed having a precision rating from 87.3% when used for English social media analysis (Horsmann mais aussi al., 2015). An ostensible restrict of one’s most recent study is the reliability of the latest POS tagger. Although not, similar analyses were did for pre-CLC and you may article-CLC datasets, definition the accuracy of POS tagger will likely be uniform more sugar baby Arizona than one another datasets. For this reason, i suppose there aren’t any health-related confounds.