Florida_Voter_File_refactoring.py
  1. #%% Introduction
  2. # -*- coding: utf-8 -*-
  3. '''
  4. put .py file in its own folder (maybe on its separate drive, try usb drive)
  5.  
  6.  
  7.  
  8. '''
  9.  
  10. #%% TO-DO LIST
  11. '''
  12. TO DO:
  13. 13) build executable/GUI using 'gooey' library (Or using 'Python GTK', maybe also with Glade)
  14. 16) HIGHER PRIORITY: Along with a set of csv files fit for SurveyMonkey, ALSO output a companion file, using the SAME entries, but including ALL columns. This allows for a DIY rematching in an elegant way for even basic users.
  15. 16a) This will require significant changes to the donow() function, where we SAMPLE BEFORE cutting out extraneous columns.
  16. 16b) The process will go something like (after importing):
  17. 16b.01) Delete entries without emails
  18. 16b.02) Recode county codes to county names
  19. 16b.03) Recode race codes to race names
  20. 16b.04) Trim whitespace and NaNs
  21. 16b.05) Calculate age
  22. 16b.051) (Delete the line to drop Birth_Date. Should be unnecessary if we drop extraneous columns with columns_keep last.)
  23. 16b.06) Cut sample
  24. 16b.07) Output all-column version
  25. 16b.08) Cull extraneous columns
  26. 16b.09) THEN output slices using the current loop programming
  27. 16b.10) End the process
  28. 17) Low priority: Adjust the use of times in progress feedback. Maybe only use time elapsed ONCE at the end?
  29. 18) For end-product application, I want to add more lines to the settings file. Like:
  30. 18.1) The settings file should include the following categories:
  31. 18.1.1) Settings, original - the original "factory-default" settings for the application. This is so that if the settings ever become broken, this remains a failsafe.
  32. 18.1.2) Settings, user defaults - this is to allow ease-of-use on a regular day-to-day
  33. 18.1.3) Settings, for current project - this is for special cases
  34. 18.2) The settings base file should be easy for the GUI package to interpret, making it easy for a user to know what settings they are changing
  35. 18.3) The settings base file probably needs, for each 'setting':
  36. 18.3.1) The associated variable name
  37. 18.3.2) the associated variable type
  38. 18.3.3) The gui text describing the setting
  39. 18.3.4) A long descripiton (for the help file) explaining each setting and how to use it
  40. 18.3.5) The actual value, string, or list that the setting variable contains
  41. 18.4) columns_keep should be an option in the settings
  42. 18.5) counties_filter: A list of strings to allow the user to select which counties they want to have in their sample
  43. 18.6) races_filter: A list of strings to allow the user to select which races they want to have in their sample
  44. 18.7) political_party_filter: A list of strings to allow the user to select which political parties they want to have in their sample
  45. 18.8) optional_filter_1_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
  46. 18.9) optional_filter_1_list: A list of strings to allow the user to select which values in optional filter #1 they want to have in their sample
  47. 18.10) optional_filter_2_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
  48. 18.11) optional_filter_2_list: A list of strings to allow the user to select which values in optional filter #2 they want to have in their sample
  49. 18.12) optional_filter_3_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
  50. 18.13) optional_filter_3_list: A list of strings to allow the user to select which values in optional filter #3 they want to have in their sample
  51. ZZ) LOWEST PRIORITY: Is there a way to EXTREMELY reduce the memory requirements?
  52. ZZa) Maybe count the number of lines per file,
  53. building an index to ALL rows available in all files
  54. (or maybe all rows with valid emails),
  55. then drawing a sample from that index,
  56. then drawing / cleaning / appending line by line to new files.
  57.  
  58.  
  59.  
  60. # in the final product, it would be cool to auto-detect counties, races, ages, zip codes, and maybe some other things
  61. 4) Find a way to turn the code into a standalone program, that works in a VERY user friendly way, such that even an intern can extract a voter file sample.
  62. 4-) SEE THIS: https://docs.python-guide.org/shipping/freezing/
  63. 4a) http://www.py2exe.org/index.cgi/Tutorial
  64. 4b) Detecting the presence of the voter files, and building a path to them?
  65. 4c) Customized sample/slice sizing?
  66. 4d) GUI? https://www.tutorialspoint.com/python/python_gui_programming.htm
  67. 4e) Folder selection dialog: https://stackoverflow.com/questions/11295917/how-to-select-a-directory-and-store-the-location-using-tkinter-in-python
  68. 4f) Help file: using the app, troubleshooting, using the voter file
  69. 4g) os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) opens up a text file copy of the voter file extract layout document
  70. 4h) OTHER GUI OPTION: 'Python GTK': https://python-gtk-3-tutorial.readthedocs.io/en/latest/introduction.html
  71. 4h.1) https://stackoverflow.com/questions/17698138/python-gui-programming-using-drag-and-drop-also-incorporating-stdout-redirect
  72. 4h.2) https://zetcode.com/python/gtk/
  73. 4i) Glade: https://glade.gnome.org/
  74. 12) an option to save settings. For example, the default 'settings' should be the user-defined source and output directory, as well as the standard set of columns and sample/slice sizes. But it would be nice to load other settings.
  75.  
  76.  
  77. DONE (success) ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
  78. 1) Try bulk importing without regard to memory usage. Try to overload the computer.
  79. 1a) Loop to iterate through voter files and concatenate them.
  80. 1b) Even consider concatenating whole-files, and not just rows with emails, just to stress-test. Performance in this regard sets the stage for 538-type voter-history analysis.
  81. 2) Build random-sample extractor
  82. 3) Currently, the email-sensor only eliminates blank and '*' emails. Can we check for and eliminate improperly formatted emails?
  83. 3a) searchfor : #EMAIL SENSOR
  84. RESPONSE: As it is, it's probably optimized. I do not capture anything but email addresses (even if improperly formatted) but cleaning improperly formatted email addresses is WAY out of scope.
  85. 5) Is garbage collection necessary? https://stackoverflow.com/a/39101287 NOT NECESSARY IN THIS PROGRAM
  86. X) when compliling to an EXE, consider storing in an end-user editable file, like a JSON to future proof against column name
  87. DONE! See 'data' folder
  88. X) in final product, would it be possible to auto-detect column headers?
  89. No, because the voter file has no headers.
  90. 6) check read_into_dictionary for \r\n compatability
  91. 7) slice & sample output loop needs refactoring and cleaning
  92. 7a) ( for start in range(0, allvoterfiles_sample.shape[0], slice_size): )
  93. X) all_voter_files progress tracker: see the python file: 'progress tracker example [see lines 128 to 150].py'
  94. 8) import_all_voter_files needs refactoring
  95. 8a) consider cleaning / or better defining how import_all_voter_files() uses globals or 'allvoterfiles' variable
  96. 8b) the goal is to load all voter files into a single dataframe for editing and for setting up the donow() algorithm
  97. 9) clean_voter_file - does this need to to be factored out?
  98. 10) date_voterfile & path_voterfile & output dir should be treated the same as the lookup dictionaries, so that the final GUI app can ask the user to define them on first launch, then treat the user-defined variables as defaults from then on.
  99. 11) Clean up use of single- vs double-quotes in strings - DONE
  100. 14) fix import_all_voter_files() log outputs so that it works when executing the donow() function
  101. 15) add output feedback on each county being imported
  102. '''
  103.  
  104.  
  105.  
  106. ''' This program is intended to extract a random-sample of emails from the Florida voter file for use in email surveys. '''
  107.  
  108. #%% Importing packages, declaring variables and functions
  109. import datetime
  110. import os
  111. import pandas as pd
  112. import glob
  113. import numpy as np
  114. import math
  115. pd.set_option('display.max_columns', None) #allows better error-checking
  116.  
  117. # initializing lists and dictionaries. Please see these files in the 'data' folder in case the column names change.
  118. running_as_console = False
  119. columns_keep = []
  120. header_voterfile = []
  121. lookup_sachsregions = {}
  122. lookup_county_code_to_name = {}
  123. lookup_race = {}
  124. #is this 'settings' variable necessary with the settings file?
  125. settings = {'path_voterfile':'C:/Users/JohnEvans/Documents/20220412_VoterDetail/',
  126. 'outputdir':'C:/Users/JohnEvans/Documents/Voterfile_Samples/',
  127. 'sample_size':'20000',
  128. 'slice_size':'5000'} #see settings file for sample_size, slice_size, and directory paths
  129. sample_size = 20000
  130. slice_size = 5000
  131. application_log = []
  132. allvoterfiles = pd.DataFrame({})
  133. time_start = datetime.datetime.now()
  134.  
  135.  
  136. def timeNow():
  137. return str(datetime.datetime.now().strftime('%Y-%m-%d T%H;%M;%S'))
  138.  
  139. def timeElapsed():
  140. global time_start
  141. time_since_start = datetime.datetime.now() - time_start
  142. application_log_builder('Time end: ' + timeNow() + '\n' + 'Time elapsed: ' + str(time_since_start))
  143.  
  144.  
  145. def application_log_builder(string):
  146. global application_log
  147. application_log.append(string)
  148. if running_as_console:
  149. print(string)
  150. else:
  151. print(string) # update tkinter widget when this program moves to UI-based app
  152.  
  153. def save_application_log():
  154. reportFilename = 'Sample Extractor Report '+timeNow()+'.txt'
  155. with open(os.path.join(settings['outputdir'],reportFilename),'a') as text:
  156. text.write('\n'.join(application_log))
  157.  
  158.  
  159. def read_into_dictionary(path): #this is to allow custom data to be stored externally
  160. temp_dict = {}
  161. with open(path,'r') as text:
  162. for line in text:
  163. delimiter = line.index(':')
  164. key = line.rstrip('\r\n')[:delimiter]
  165. value = line.rstrip('\r\n')[delimiter+1:]
  166. temp_dict[key] = value
  167. return temp_dict
  168.  
  169.  
  170. def load_variables(): #loading external data
  171. global columns_keep
  172. global header_voterfile
  173. global lookup_sachsregions
  174. global lookup_county_code_to_name
  175. global lookup_race
  176. global settings
  177. with open(os.path.join('data', 'columns_keep.txt')) as f:
  178. columns_keep = [word.rstrip() for word in f]
  179.  
  180. with open(os.path.join('data', 'header_voterfile.txt')) as f:
  181. header_voterfile = [word.rstrip() for word in f]
  182.  
  183. lookup_sachsregions = read_into_dictionary(os.path.join('data', 'lookup_sachsregions.txt'))
  184. lookup_county_code_to_name = read_into_dictionary(os.path.join('data', 'lookup_county_code_to_name.txt'))
  185. lookup_race = read_into_dictionary(os.path.join('data', 'lookup_race.txt'))
  186. settings = read_into_dictionary(os.path.join('data', 'settings.txt'))
  187. sample_size = int(settings['sample_size'])
  188. slice_size = int(settings['slice_size'])
  189.  
  190. def show_vf_layout(): #opening a text copy of the voter file extract layout (data dictionary) in the installed text editor.
  191. os.startfile(os.path.join('data', 'voter_file_extract_layout.txt'))
  192.  
  193. def filter(df, column, filter_by): # this is for filtering a sample by a set of counties, races, etc.
  194. # for continuous intervals, see: https://stackoverflow.com/questions/30624599/pandas-isin-function-for-continuous-intervals
  195. if type(filter_by) is not list:
  196. filter_by = [filter_by]
  197. filteredDF = df.loc[df[column].isin(filter_by)]
  198. return filteredDF
  199.  
  200. def import_all_voter_files(): # https://stackoverflow.com/a/36416258
  201. #Note, to get counts by category, use: allvoterfiles['Party_Affiliation'].value_counts()
  202. global allvoterfiles
  203.  
  204. load_variables()
  205. application_log_builder('Reading all voter files from directory: \n' + settings['path_voterfile'] + '\n' + 'Time start: ' + timeNow())
  206.  
  207. path = settings['path_voterfile']
  208. all_files = glob.glob(os.path.join(path, '*.txt')) # all_files is a list of all files
  209.  
  210. # looping through all_files to add each county's voterfile as an element in a list - a list of dataframes
  211. #on second thought - I think I want an message output signalling each state's file being worked through. It should just be a "print(f)" put in the right place.
  212. #df_from_each_file = (pd.read_csv(f,
  213. # sep='\t',
  214. # header=None,
  215. # names=header_voterfile,
  216. # dtype=str) for f in all_files) # Is there any advantage to making Voter_ID the index?
  217.  
  218.  
  219. df_from_each_file = []
  220. file_sizes = []
  221. progressbar_value = 0
  222.  
  223.  
  224. for s in all_files:
  225. file_sizes.append(os.path.getsize(s))
  226.  
  227. print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*100))+"% beginning collection") #sanitycheck
  228.  
  229. # collecting all voter files
  230. for f in all_files:
  231. df_from_each_file.append(pd.read_csv(f,
  232. sep='\t',
  233. header=None,
  234. names=header_voterfile,
  235. dtype=str))
  236.  
  237. progressbar_value = sum(file_sizes[0:all_files.index(f)+1])
  238. print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*10000)/100)+"% " + "{: >8.2f}".format(file_sizes[all_files.index(f)]/1048576) + "MB " + f) #sanitycheck
  239.  
  240.  
  241.  
  242.  
  243. allvoterfiles = pd.concat(df_from_each_file, ignore_index=True)
  244.  
  245. allvoterfiles.info() #SANITYCHECK
  246.  
  247. timeElapsed()
  248.  
  249.  
  250. #QUESTION: Why does the calculate_age() function throw an error PRIOR to dropping columns using df[columns_keep]? Why doesn't it work for the start?
  251. def calculate_age(df):
  252. application_log_builder('Calculating ages...')
  253. #df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], errors='ignore')
  254. df["Birth_Date"] = pd.to_datetime(df["Birth_Date"]) #This doesn't seem to work when used in this context. When used on its own it seems to work. Why does it not work in this context? Why the inconsistency? How to make work in this context?
  255. #NEW VERSION? df["Birth_Date"] = pd.to_datetime(df["Birth_Date"], errors='coerce') # errors = 'coerce' is meant to push past any entries that do not resemble dates.
  256. today = datetime.datetime.now()
  257. df['Age'] = today - df['Birth_Date']
  258. df['Age'] = df['Age'].dt.days
  259. df['Age'] = np.floor(df['Age']/365.2425).astype('int')
  260. timeElapsed()
  261. return df
  262.  
  263. #def calculate_age():
  264. # try:
  265. # return dt.datetime.strptime(x, '%m/%d/%Y')
  266. # except:
  267. # return pd.NaT
  268. ##then use: df['Birth_Date'].apply(calculate_age)
  269.  
  270. #def calculate_age(df):
  271. # df.loc[df['Birth_Date'] !=float("NaN"), 'Birth_Date'] = pd.to_datetime(df['Birth_Date'])
  272.  
  273. def clean_voter_file(df):
  274. application_log_builder('Cleaning voter file . . .')
  275. application_log_builder('Cleaning up NaNs . . .')
  276. df = df.replace(np.nan, '', regex=True)
  277. application_log_builder('Cleaning up *s . . .')
  278. df = df.replace('*', '', regex=False)
  279. application_log_builder('Trimming blank spaces from email addresses . . .')
  280. df['Email_Address'] = df['Email_Address'].str.strip()
  281. timeElapsed()
  282. return df
  283.  
  284.  
  285. def delete_entries_without_emails(df):
  286. application_log_builder('Deleting entries that do not have email addresses . . .')
  287. df = df[df['Email_Address'].str.len()>2]
  288. timeElapsed()
  289. return df
  290.  
  291.  
  292. #%% MAIN sample-extraction function
  293. def donow():
  294. global time_start
  295.  
  296. global allvoterfiles
  297.  
  298. import_all_voter_files()
  299.  
  300. allvoterfiles = delete_entries_without_emails(allvoterfiles)
  301. #allvoterfiles = allvoterfiles[allvoterfiles['Email_Address'].str.len()>2] #replace with delete_entries_without_emails()?
  302. allvoterfiles = allvoterfiles[columns_keep] # eliminating extraneous columns
  303. allvoterfiles = allvoterfiles.replace({'County_Code':lookup_county_code_to_name}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
  304. #allvoterfiles = allvoterfiles.replace({'County_Code':lookup_region}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
  305. allvoterfiles = allvoterfiles.replace({'Race':lookup_race}) # re-coding Race codes as verbose racial labels using a lookup-dictionary
  306. allvoterfiles = clean_voter_file(allvoterfiles) # trimming whitespace and NaNs
  307.  
  308. allvoterfiles = calculate_age(allvoterfiles)
  309. allvoterfiles = allvoterfiles.drop(columns=['Birth_Date'])
  310.  
  311. application_log_builder('Cutting a sample of: ' + settings['sample_size'] +' entries')
  312.  
  313. allvoterfiles_sample = allvoterfiles.sample(n=sample_size)
  314. timeSampleExtracted = timeNow()
  315. timeElapsed()
  316.  
  317. application_log_builder('Exporting sample slices...')
  318. # I think I should turn the following for-loop into a function - ADD TO TO-DO?
  319. # I also think I should generalize this donow() function so that ALL counties or just one county can be extracted - ADD TO TO-DO?
  320. for start in range(0, allvoterfiles_sample.shape[0], slice_size): #can we do allvoterfiles_sample[columns_keep] instead? So that we can export BOTH a SurveyMonkey-ready sample with 8 columns, and another version of the sample sample with ALL columns (for rematching if necessary)?
  321.  
  322. subset = allvoterfiles_sample.iloc[start:start + slice_size]
  323.  
  324. #creating numerical labels for each slice
  325. section = str(int(start/slice_size)+1)
  326. sectionTotal = str(int(math.ceil(sample_size/slice_size)))
  327.  
  328. #building filename string
  329. startingEntry = str(start+1).zfill(5)
  330. endingEntry = str(int(min(sample_size, start+slice_size))).zfill(5)
  331. outputFilename = timeSampleExtracted + ' ' + startingEntry + '-' + endingEntry + ' sample of ' + settings['sample_size'] + '.csv'
  332.  
  333. #output message
  334. application_log_builder('...exporting section ' + section + ' of ' + sectionTotal)
  335.  
  336. subset.to_csv(os.path.join(settings['outputdir'],outputFilename), header=False, index=False)
  337.  
  338. application_log_builder('DONE!\n\nFor your files, look in directory: \n\t'+str(os.path.join(settings['outputdir'])))
  339. timeElapsed()
  340.  
  341.  
  342.  
  343.  
  344.  
  345. if __name__ == "__main__":
  346. running_as_console = True
  347. application_log_builder('Welcome to the SMG Florida Voter-file handler!'+'\n'+'The time is: ' + timeNow())
  348. load_variables()
  349.  
  350.  
  351. ##ASK FOR INPUT
  352. ## DO MAGIC STUFF
  353. ## Set variable path_voterfile to whatever the user input
  354. ##FINALLY CALL DONOW()