The data contains id posts and comments of Reddit submissions, taken from Reddit api. We used hadoop to process the raw json dump from https://files.pushshift.io/reddit/
User ids that we publish do not correspond to the Reddit api, for privacy reasons.
The data is stored in the following format:

predicate_name \t user_id \t attribute_value_1:::attribute_value_2:::...:::attribute_value_n \t message_id \n

\t is the separator of fields; ":::" is the separator for multiple attribute values. An example of the entry: 

64427   paramedic:::firefighter dgnk3pe

To retrieve the actual contents of the submissions, use the script provided in
https://github.com/Anna146/HiddenAttributeModels/tree/master/prepare_data/hadoop