pyconversations.reader¶
The pyconversation.reader sub-module contains Classes which can read from disk into a universal, conversation format. While the ConvoReader can and should be used (especially post-conversion into the universal format), other readers are provided as examples of how to augment the basic function of a reader to get it to read other file formats one might have from an API.
- class pyconversations.reader.BNCReader[source]¶
A custom Reddit Reader generated for the data format from “Before Name-calling: Dynamics and Triggers of Ad Hominem Fallacies in Web Argumentation” (Habernal et al., 2018).
Notes
See: https://www.aclweb.org/anthology/N18-1036/
- static iter_read(path_pattern, ld=True, rd=False)[source]¶
Function for creating a conversation reading iterator. Will read and parse part of a file/directory, yielding segments as queried.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether or not activate language detection (Default: True)
rd (bool) – Whether to use the secondary Reddit parser (RedditPost.parse_rd) or not (RedditPost.parse_raw) (Default: False)
- Raises
NotImplementedError –
- static read(path_pattern, ld=True)[source]¶
Reads the entire archive of posts from this dataset. Posts that violate rule 2 of the r/ChangeMyView sub-reddit are tagged with the AH=1 tag; otherwise, posts are tagged with AH=0.
- Parameters
path_pattern (str) – The path to the directory containing the data
ld (bool) – Whether or not activate language detection (Default: True)
- Returns
list(Conversation) – A list of all parsed and segmented disjoint Conversations within this dataset
- class pyconversations.reader.BaseReader[source]¶
Abstract Reader class. Defines the two functions that Readers may implement to read from disk.
- class pyconversations.reader.ChanReader[source]¶
Reader class for reading and converting raw 4chan data
- static iter_read(path_pattern, ld=True)[source]¶
Function for iteratively reading an entire file/directory of conversations. Currently expects a path_pattern that points to a directory of JSON files enumerated from 00 to 99.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether or not language detection should be activated. (Default: True)
- Yields
2-tuple(int, Conversation) – A tuple containing which chunk (in 0..99) this Conversation originated from as well as a Conversation segment.
- static read(path_pattern, ld=True)[source]¶
Function for reading an entire file/directory of conversations.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether or not language detection should be activated. (Default: True)
- Raises
NotImplementedError –
- class pyconversations.reader.ConvoReader[source]¶
Universal Conversation reader. Once parsing raw files into the Universal format, one can save them to disk and re-load them using this Reader class.
- static iter_read(path_pattern)[source]¶
Function for creating a conversation reading iterator. Will read and parse part of a file/directory, yielding conversations as queried.
- Parameters
path_pattern (str) – The path to a directory containing Conversation data. This path will be appended with the pattern *.json.
- Yields
Conversation – A conversation, read from disk.
- class pyconversations.reader.QuoteReader[source]¶
A reader specifically designed to read JSONs of Quote tweet archives.
- static iter_read(path_pattern, ld=True)[source]¶
Function for creating a conversation reading iterator. Will read and parse part of a file/directory, yielding segments as queried.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether to activate language detection (Default: True)
- Raises
NotImplementedError –
- static read(path_pattern, ld=True)[source]¶
Reads an entire directory of quote tweet JSONLine files, segments them into disjoint conversations, and returns the conversations.
- Parameters
path_pattern (str) – The path to the directory
ld (bool) – Whether to activate language detection (Default: True)
- Returns
list(Conversation) – A list of disjoint conversations
- class pyconversations.reader.RawFBReader[source]¶
Reader for raw FB data
- static iter_read(path_pattern, ld=True)[source]¶
Given a path_pattern that points to a directory containing raw FB data in the form of path_pattern/PAGES/RAW_DATA.json, this function will iteratively read the files and produce Conversational data.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether or not language detection should be activated. (Default: True)
- Yields
2-tuple(str, Conversation) – The name of the page (as parsed) and an associated Conversation from that page
- Raises
ValueError – If a JSON file is encountered that isn’t named as one of: post, comments, replies, attach, react, scrape
- class pyconversations.reader.RedditReader[source]¶
General Reddit raw data reader.
- static iter_read(path_pattern, ld=True, rd=False)[source]¶
This iterative reading function assumes that the path it will be pointed towards contains raw Reddit comments and submissions, sorted/chunked by the month they were created.
- Parameters
path_pattern (str) – The path to the directory containing the data
ld (bool) – Whether or not activate language detection (Default: True)
rd (bool) – Whether to use the secondary Reddit parser (RedditPost.parse_rd) or not (RedditPost.parse_raw) (Default: False)
- Yields
list(Conversation) – A chunk of Conversations, as parsed
- class pyconversations.reader.ThreadsReader[source]¶
This is a custom Twitter “Threads” Reader. May be deprecated to adopt new Twitter reply functionality.
- static iter_read(path_pattern, ld=True)[source]¶
Function for creating a conversation reading iterator. Will read and parse part of a file/directory, yielding segments as queried.
- Parameters
path_pattern (str) – The path to file or directory containing Conversation data
ld (bool) – Whether to activate language detection (Default: True)
- Yields
2-tuple(str, list(Conversation)) – The string ID of the threaded discussion and a list of the disjoint Conversations identified within it