The data contains id posts and comments of Reddit submissions, taken from Reddit api. We used hadoop to process the raw json dump from https://files.pushshift.io/reddit/ User ids that we publish do not correspond to the Reddit api, for privacy reasons. The data is stored in the following format: predicate_name \t user_id \t attribute_value_1:::attribute_value_2:::...:::attribute_value_n \t message_id \n \t is the separator of fields; ":::" is the separator for multiple attribute values. An example of the entry: 64427 paramedic:::firefighter dgnk3pe To retrieve the actual contents of the submissions, use the script provided in https://github.com/Anna146/HiddenAttributeModels/tree/master/prepare_data/hadoop