florida_voter_file_sample_extractor
- Florida_Voter_File_refactoring.py
- #%% Introduction
- # -*- coding: utf-8 -*-
- '''
- put .py file in its own folder (maybe on its separate drive, try usb drive)
- '''
- #%% TO-DO LIST
- '''
- TO DO:
- 13) build executable/GUI using 'gooey' library (Or using 'Python GTK', maybe also with Glade)
- 16) HIGHER PRIORITY: Along with a set of csv files fit for SurveyMonkey, ALSO output a companion file, using the SAME entries, but including ALL columns. This allows for a DIY rematching in an elegant way for even basic users.
- 16a) This will require significant changes to the donow() function, where we SAMPLE BEFORE cutting out extraneous columns.
- 16b) The process will go something like (after importing):
- 16b.01) Delete entries without emails
- 16b.02) Recode county codes to county names
- 16b.03) Recode race codes to race names
- 16b.04) Trim whitespace and NaNs
- 16b.05) Calculate age
- 16b.051) (Delete the line to drop Birth_Date. Should be unnecessary if we drop extraneous columns with columns_keep last.)
- 16b.06) Cut sample
- 16b.07) Output all-column version
- 16b.08) Cull extraneous columns
- 16b.09) THEN output slices using the current loop programming
- 16b.10) End the process
- 17) Low priority: Adjust the use of times in progress feedback. Maybe only use time elapsed ONCE at the end?
- 18) For end-product application, I want to add more lines to the settings file. Like:
- 18.1) The settings file should include the following categories:
- 18.1.1) Settings, original - the original "factory-default" settings for the application. This is so that if the settings ever become broken, this remains a failsafe.
- 18.1.2) Settings, user defaults - this is to allow ease-of-use on a regular day-to-day
- 18.1.3) Settings, for current project - this is for special cases
- 18.2) The settings base file should be easy for the GUI package to interpret, making it easy for a user to know what settings they are changing
- 18.3) The settings base file probably needs, for each 'setting':
- 18.3.1) The associated variable name
- 18.3.2) the associated variable type
- 18.3.3) The gui text describing the setting
- 18.3.4) A long descripiton (for the help file) explaining each setting and how to use it
- 18.3.5) The actual value, string, or list that the setting variable contains
- 18.4) columns_keep should be an option in the settings
- 18.5) counties_filter: A list of strings to allow the user to select which counties they want to have in their sample
- 18.6) races_filter: A list of strings to allow the user to select which races they want to have in their sample
- 18.7) political_party_filter: A list of strings to allow the user to select which political parties they want to have in their sample
- 18.8) optional_filter_1_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
- 18.9) optional_filter_1_list: A list of strings to allow the user to select which values in optional filter #1 they want to have in their sample
- 18.10) optional_filter_2_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
- 18.11) optional_filter_2_list: A list of strings to allow the user to select which values in optional filter #2 they want to have in their sample
- 18.12) optional_filter_3_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
- 18.13) optional_filter_3_list: A list of strings to allow the user to select which values in optional filter #3 they want to have in their sample
- ZZ) LOWEST PRIORITY: Is there a way to EXTREMELY reduce the memory requirements?
- ZZa) Maybe count the number of lines per file,
- building an index to ALL rows available in all files
- (or maybe all rows with valid emails),
- then drawing a sample from that index,
- then drawing / cleaning / appending line by line to new files.
- # in the final product, it would be cool to auto-detect counties, races, ages, zip codes, and maybe some other things
- 4) Find a way to turn the code into a standalone program, that works in a VERY user friendly way, such that even an intern can extract a voter file sample.
- 4-) SEE THIS: https://docs.python-guide.org/shipping/freezing/
- 4a) http://www.py2exe.org/index.cgi/Tutorial
- 4b) Detecting the presence of the voter files, and building a path to them?
- 4c) Customized sample/slice sizing?
- 4d) GUI? https://www.tutorialspoint.com/python/python_gui_programming.htm
- 4e) Folder selection dialog: https://stackoverflow.com/questions/11295917/how-to-select-a-directory-and-store-the-location-using-tkinter-in-python
- 4f) Help file: using the app, troubleshooting, using the voter file
- 4g) os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) opens up a text file copy of the voter file extract layout document
- 4h) OTHER GUI OPTION: 'Python GTK': https://python-gtk-3-tutorial.readthedocs.io/en/latest/introduction.html
- 4h.1) https://stackoverflow.com/questions/17698138/python-gui-programming-using-drag-and-drop-also-incorporating-stdout-redirect
- 4h.2) https://zetcode.com/python/gtk/
- 4i) Glade: https://glade.gnome.org/
- 12) an option to save settings. For example, the default 'settings' should be the user-defined source and output directory, as well as the standard set of columns and sample/slice sizes. But it would be nice to load other settings.
- DONE (success) ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
- 1) Try bulk importing without regard to memory usage. Try to overload the computer.
- 1a) Loop to iterate through voter files and concatenate them.
- 1b) Even consider concatenating whole-files, and not just rows with emails, just to stress-test. Performance in this regard sets the stage for 538-type voter-history analysis.
- 2) Build random-sample extractor
- 3) Currently, the email-sensor only eliminates blank and '*' emails. Can we check for and eliminate improperly formatted emails?
- 3a) searchfor : #EMAIL SENSOR
- RESPONSE: As it is, it's probably optimized. I do not capture anything but email addresses (even if improperly formatted) but cleaning improperly formatted email addresses is WAY out of scope.
- 5) Is garbage collection necessary? https://stackoverflow.com/a/39101287 NOT NECESSARY IN THIS PROGRAM
- X) when compliling to an EXE, consider storing in an end-user editable file, like a JSON to future proof against column name
- DONE! See 'data' folder
- X) in final product, would it be possible to auto-detect column headers?
- No, because the voter file has no headers.
- 6) check read_into_dictionary for \r\n compatability
- 7) slice & sample output loop needs refactoring and cleaning
- 7a) ( for start in range(0, allvoterfiles_sample.shape[0], slice_size): )
- X) all_voter_files progress tracker: see the python file: 'progress tracker example [see lines 128 to 150].py'
- 8) import_all_voter_files needs refactoring
- 8a) consider cleaning / or better defining how import_all_voter_files() uses globals or 'allvoterfiles' variable
- 8b) the goal is to load all voter files into a single dataframe for editing and for setting up the donow() algorithm
- 9) clean_voter_file - does this need to to be factored out?
- 10) date_voterfile & path_voterfile & output dir should be treated the same as the lookup dictionaries, so that the final GUI app can ask the user to define them on first launch, then treat the user-defined variables as defaults from then on.
- 11) Clean up use of single- vs double-quotes in strings - DONE
- 14) fix import_all_voter_files() log outputs so that it works when executing the donow() function
- 15) add output feedback on each county being imported
- '''
- ''' This program is intended to extract a random-sample of emails from the Florida voter file for use in email surveys. '''
- #%% Importing packages, declaring variables and functions
- import datetime
- import os
- import pandas as pd
- import glob
- import numpy as np
- import math
- pd.set_option('display.max_columns', None) #allows better error-checking
- # initializing lists and dictionaries. Please see these files in the 'data' folder in case the column names change.
- running_as_console = False
- columns_keep = []
- header_voterfile = []
- lookup_sachsregions = {}
- lookup_county_code_to_name = {}
- lookup_race = {}
- #is this 'settings' variable necessary with the settings file?
- settings = {'path_voterfile':'C:/Users/JohnEvans/Documents/20220412_VoterDetail/',
- 'outputdir':'C:/Users/JohnEvans/Documents/Voterfile_Samples/',
- 'sample_size':'20000',
- 'slice_size':'5000'} #see settings file for sample_size, slice_size, and directory paths
- sample_size = 20000
- slice_size = 5000
- application_log = []
- allvoterfiles = pd.DataFrame({})
- time_start = datetime.datetime.now()
- def timeNow():
- return str(datetime.datetime.now().strftime('%Y-%m-%d T%H;%M;%S'))
- def timeElapsed():
- global time_start
- time_since_start = datetime.datetime.now() - time_start
- application_log_builder('Time end: ' + timeNow() + '\n' + 'Time elapsed: ' + str(time_since_start))
- def application_log_builder(string):
- global application_log
- application_log.append(string)
- if running_as_console:
- print(string)
- else:
- print(string) # update tkinter widget when this program moves to UI-based app
- def save_application_log():
- reportFilename = 'Sample Extractor Report '+timeNow()+'.txt'
- with open(os.path.join(settings['outputdir'],reportFilename),'a') as text:
- text.write('\n'.join(application_log))
- def read_into_dictionary(path): #this is to allow custom data to be stored externally
- temp_dict = {}
- with open(path,'r') as text:
- for line in text:
- delimiter = line.index(':')
- key = line.rstrip('\r\n')[:delimiter]
- value = line.rstrip('\r\n')[delimiter+1:]
- temp_dict[key] = value
- return temp_dict
- def load_variables(): #loading external data
- global columns_keep
- global header_voterfile
- global lookup_sachsregions
- global lookup_county_code_to_name
- global lookup_race
- global settings
- with open(os.path.join('data', 'columns_keep.txt')) as f:
- columns_keep = [word.rstrip() for word in f]
- with open(os.path.join('data', 'header_voterfile.txt')) as f:
- header_voterfile = [word.rstrip() for word in f]
- lookup_sachsregions = read_into_dictionary(os.path.join('data', 'lookup_sachsregions.txt'))
- lookup_county_code_to_name = read_into_dictionary(os.path.join('data', 'lookup_county_code_to_name.txt'))
- lookup_race = read_into_dictionary(os.path.join('data', 'lookup_race.txt'))
- settings = read_into_dictionary(os.path.join('data', 'settings.txt'))
- sample_size = int(settings['sample_size'])
- slice_size = int(settings['slice_size'])
- def show_vf_layout(): #opening a text copy of the voter file extract layout (data dictionary) in the installed text editor.
- os.startfile(os.path.join('data', 'voter_file_extract_layout.txt'))
- def filter(df, column, filter_by): # this is for filtering a sample by a set of counties, races, etc.
- # for continuous intervals, see: https://stackoverflow.com/questions/30624599/pandas-isin-function-for-continuous-intervals
- if type(filter_by) is not list:
- filter_by = [filter_by]
- filteredDF = df.loc[df[column].isin(filter_by)]
- return filteredDF
- def import_all_voter_files(): # https://stackoverflow.com/a/36416258
- #Note, to get counts by category, use: allvoterfiles['Party_Affiliation'].value_counts()
- global allvoterfiles
- load_variables()
- application_log_builder('Reading all voter files from directory: \n' + settings['path_voterfile'] + '\n' + 'Time start: ' + timeNow())
- path = settings['path_voterfile']
- all_files = glob.glob(os.path.join(path, '*.txt')) # all_files is a list of all files
- # looping through all_files to add each county's voterfile as an element in a list - a list of dataframes
- #on second thought - I think I want an message output signalling each state's file being worked through. It should just be a "print(f)" put in the right place.
- #df_from_each_file = (pd.read_csv(f,
- # sep='\t',
- # header=None,
- # names=header_voterfile,
- # dtype=str) for f in all_files) # Is there any advantage to making Voter_ID the index?
- df_from_each_file = []
- file_sizes = []
- progressbar_value = 0
- for s in all_files:
- file_sizes.append(os.path.getsize(s))
- print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*100))+"% beginning collection") #sanitycheck
- # collecting all voter files
- for f in all_files:
- df_from_each_file.append(pd.read_csv(f,
- sep='\t',
- header=None,
- names=header_voterfile,
- dtype=str))
- progressbar_value = sum(file_sizes[0:all_files.index(f)+1])
- print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*10000)/100)+"% " + "{: >8.2f}".format(file_sizes[all_files.index(f)]/1048576) + "MB " + f) #sanitycheck
- allvoterfiles = pd.concat(df_from_each_file, ignore_index=True)
- allvoterfiles.info() #SANITYCHECK
- timeElapsed()
- #QUESTION: Why does the calculate_age() function throw an error PRIOR to dropping columns using df[columns_keep]? Why doesn't it work for the start?
- def calculate_age(df):
- application_log_builder('Calculating ages...')
- #df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], errors='ignore')
- df["Birth_Date"] = pd.to_datetime(df["Birth_Date"]) #This doesn't seem to work when used in this context. When used on its own it seems to work. Why does it not work in this context? Why the inconsistency? How to make work in this context?
- #NEW VERSION? df["Birth_Date"] = pd.to_datetime(df["Birth_Date"], errors='coerce') # errors = 'coerce' is meant to push past any entries that do not resemble dates.
- today = datetime.datetime.now()
- df['Age'] = today - df['Birth_Date']
- df['Age'] = df['Age'].dt.days
- df['Age'] = np.floor(df['Age']/365.2425).astype('int')
- timeElapsed()
- return df
- #def calculate_age():
- # try:
- # return dt.datetime.strptime(x, '%m/%d/%Y')
- # except:
- # return pd.NaT
- ##then use: df['Birth_Date'].apply(calculate_age)
- #def calculate_age(df):
- # df.loc[df['Birth_Date'] !=float("NaN"), 'Birth_Date'] = pd.to_datetime(df['Birth_Date'])
- def clean_voter_file(df):
- application_log_builder('Cleaning voter file . . .')
- application_log_builder('Cleaning up NaNs . . .')
- df = df.replace(np.nan, '', regex=True)
- application_log_builder('Cleaning up *s . . .')
- df = df.replace('*', '', regex=False)
- application_log_builder('Trimming blank spaces from email addresses . . .')
- df['Email_Address'] = df['Email_Address'].str.strip()
- timeElapsed()
- return df
- def delete_entries_without_emails(df):
- application_log_builder('Deleting entries that do not have email addresses . . .')
- df = df[df['Email_Address'].str.len()>2]
- timeElapsed()
- return df
- #%% MAIN sample-extraction function
- def donow():
- global time_start
- global allvoterfiles
- import_all_voter_files()
- allvoterfiles = delete_entries_without_emails(allvoterfiles)
- #allvoterfiles = allvoterfiles[allvoterfiles['Email_Address'].str.len()>2] #replace with delete_entries_without_emails()?
- allvoterfiles = allvoterfiles[columns_keep] # eliminating extraneous columns
- allvoterfiles = allvoterfiles.replace({'County_Code':lookup_county_code_to_name}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
- #allvoterfiles = allvoterfiles.replace({'County_Code':lookup_region}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
- allvoterfiles = allvoterfiles.replace({'Race':lookup_race}) # re-coding Race codes as verbose racial labels using a lookup-dictionary
- allvoterfiles = clean_voter_file(allvoterfiles) # trimming whitespace and NaNs
- allvoterfiles = calculate_age(allvoterfiles)
- allvoterfiles = allvoterfiles.drop(columns=['Birth_Date'])
- application_log_builder('Cutting a sample of: ' + settings['sample_size'] +' entries')
- allvoterfiles_sample = allvoterfiles.sample(n=sample_size)
- timeSampleExtracted = timeNow()
- timeElapsed()
- application_log_builder('Exporting sample slices...')
- # I think I should turn the following for-loop into a function - ADD TO TO-DO?
- # I also think I should generalize this donow() function so that ALL counties or just one county can be extracted - ADD TO TO-DO?
- for start in range(0, allvoterfiles_sample.shape[0], slice_size): #can we do allvoterfiles_sample[columns_keep] instead? So that we can export BOTH a SurveyMonkey-ready sample with 8 columns, and another version of the sample sample with ALL columns (for rematching if necessary)?
- subset = allvoterfiles_sample.iloc[start:start + slice_size]
- #creating numerical labels for each slice
- section = str(int(start/slice_size)+1)
- sectionTotal = str(int(math.ceil(sample_size/slice_size)))
- #building filename string
- startingEntry = str(start+1).zfill(5)
- endingEntry = str(int(min(sample_size, start+slice_size))).zfill(5)
- outputFilename = timeSampleExtracted + ' ' + startingEntry + '-' + endingEntry + ' sample of ' + settings['sample_size'] + '.csv'
- #output message
- application_log_builder('...exporting section ' + section + ' of ' + sectionTotal)
- subset.to_csv(os.path.join(settings['outputdir'],outputFilename), header=False, index=False)
- application_log_builder('DONE!\n\nFor your files, look in directory: \n\t'+str(os.path.join(settings['outputdir'])))
- timeElapsed()
- if __name__ == "__main__":
- running_as_console = True
- application_log_builder('Welcome to the SMG Florida Voter-file handler!'+'\n'+'The time is: ' + timeNow())
- load_variables()
- ##ASK FOR INPUT
- ## DO MAGIC STUFF
- ## Set variable path_voterfile to whatever the user input
- ##FINALLY CALL DONOW()
florida_voter_file_sample_extractor.txt · Last modified: 2022/04/30 18:50 by john