florida_voter_file_sample

Florida_Voter_File_refactoring.py
#%% Introduction
# -*- coding: utf-8 -*-
'''
put .py file in its own folder (maybe on its separate drive, try usb drive)
 
 
 
'''
 
#%% TO-DO LIST
'''
TO DO:
    13) build executable/GUI using 'gooey' library (Or using 'Python GTK', maybe also with Glade)
    16) HIGHER PRIORITY: Along with a set of csv files fit for SurveyMonkey, ALSO output a companion file, using the SAME entries, but including ALL columns. This allows for a DIY rematching in an elegant way for even basic users.
        16a) This will require significant changes to the donow() function, where we SAMPLE BEFORE cutting out extraneous columns.
        16b) The process will go something like (after importing): 
            16b.01) Delete entries without emails
            16b.02) Recode county codes to county names
            16b.03) Recode race codes to race names
            16b.04) Trim whitespace and NaNs
            16b.05) Calculate age
            16b.051) (Delete the line to drop Birth_Date. Should be unnecessary if we drop extraneous columns with columns_keep last.)
            16b.06) Cut sample
            16b.07) Output all-column version
            16b.08) Cull extraneous columns
            16b.09) THEN output slices using the current loop programming
            16b.10) End the process
    17) Low priority: Adjust the use of times in progress feedback. Maybe only use time elapsed ONCE at the end?
    18) For end-product application, I want to add more lines to the settings file. Like:
        18.1) The settings file should include the following categories:
            18.1.1) Settings, original - the original "factory-default" settings for the application. This is so that if the settings ever become broken, this remains a failsafe.
            18.1.2) Settings, user defaults - this is to allow ease-of-use on a regular day-to-day
            18.1.3) Settings, for current project - this is for special cases
        18.2) The settings base file should be easy for the GUI package to interpret, making it easy for a user to know what settings they are changing
        18.3) The settings base file probably needs, for each 'setting':
            18.3.1) The associated variable name
            18.3.2) the associated variable type
            18.3.3) The gui text describing the setting
            18.3.4) A long descripiton (for the help file) explaining each setting and how to use it
            18.3.5) The actual value, string, or list that the setting variable contains
        18.4) columns_keep should be an option in the settings
        18.5) counties_filter: A list of strings to allow the user to select which counties they want to have in their sample
        18.6) races_filter: A list of strings to allow the user to select which races they want to have in their sample
        18.7) political_party_filter: A list of strings to allow the user to select which political parties they want to have in their sample
        18.8) optional_filter_1_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
        18.9) optional_filter_1_list: A list of strings to allow the user to select which values in optional filter #1 they want to have in their sample
        18.10) optional_filter_2_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
        18.11) optional_filter_2_list: A list of strings to allow the user to select which values in optional filter #2 they want to have in their sample
        18.12) optional_filter_3_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
        18.13) optional_filter_3_list: A list of strings to allow the user to select which values in optional filter #3 they want to have in their sample
    ZZ) LOWEST PRIORITY: Is there a way to EXTREMELY reduce the memory requirements?
        ZZa) Maybe count the number of lines per file, 
            building an index to ALL rows available in all files 
            (or maybe all rows with valid emails), 
            then drawing a sample from that index, 
            then drawing / cleaning / appending line by line to new files.
 
 
 
# in the final product, it would be cool to auto-detect counties, races, ages, zip codes, and maybe some other things
    4) Find a way to turn the code into a standalone program, that works in a VERY user friendly way, such that even an intern can extract a voter file sample.
        4-) SEE THIS: https://docs.python-guide.org/shipping/freezing/
        4a) http://www.py2exe.org/index.cgi/Tutorial
        4b) Detecting the presence of the voter files, and building a path to them?
        4c) Customized sample/slice sizing?  
        4d) GUI? https://www.tutorialspoint.com/python/python_gui_programming.htm
        4e) Folder selection dialog: https://stackoverflow.com/questions/11295917/how-to-select-a-directory-and-store-the-location-using-tkinter-in-python
        4f) Help file: using the app, troubleshooting, using the voter file
        4g) os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) opens up a text file copy of the voter file extract layout document
        4h) OTHER GUI OPTION: 'Python GTK': https://python-gtk-3-tutorial.readthedocs.io/en/latest/introduction.html
            4h.1) https://stackoverflow.com/questions/17698138/python-gui-programming-using-drag-and-drop-also-incorporating-stdout-redirect
            4h.2) https://zetcode.com/python/gtk/
        4i) Glade: https://glade.gnome.org/
    12) an option to save settings. For example, the default 'settings' should be the user-defined source and output directory, as well as the standard set of columns and sample/slice sizes. But it would be nice to load other settings.
 
 
DONE (success)    -----    -----    -----    -----    -----    -----    -----    -----    -----    -----
    1) Try bulk importing without regard to memory usage. Try to overload the computer.
        1a) Loop to iterate through voter files and concatenate them.
        1b) Even consider concatenating whole-files, and not just rows with emails, just to stress-test. Performance in this regard sets the stage for 538-type voter-history analysis.    
    2) Build random-sample extractor
    3) Currently, the email-sensor only eliminates blank and '*' emails. Can we check for and eliminate improperly formatted emails?
        3a) searchfor : #EMAIL SENSOR
        RESPONSE: As it is, it's probably optimized. I do not capture anything but email addresses (even if improperly formatted) but cleaning improperly formatted email addresses is WAY out of scope.
    5) Is garbage collection necessary? https://stackoverflow.com/a/39101287 NOT NECESSARY IN THIS PROGRAM
    X) when compliling to an EXE, consider storing in an end-user editable file, like a JSON to future proof against column name
        DONE! See 'data' folder
    X) in final product, would it be possible to auto-detect column headers?
        No, because the voter file has no headers.
    6) check read_into_dictionary for \r\n compatability
    7) slice & sample output loop needs refactoring and cleaning
        7a) (    for start in range(0, allvoterfiles_sample.shape[0], slice_size):    )
    X) all_voter_files progress tracker: see the python file: 'progress tracker example [see lines 128 to 150].py'
    8) import_all_voter_files needs refactoring
        8a) consider cleaning / or better defining how import_all_voter_files() uses globals or 'allvoterfiles' variable
        8b) the goal is to load all voter files into a single dataframe for editing and for setting up the donow() algorithm
    9) clean_voter_file - does this need to to be factored out?
    10) date_voterfile & path_voterfile & output dir should be treated the same as the lookup dictionaries, so that the final GUI app can ask the user to define them on first launch, then treat the user-defined variables as defaults from then on.
    11) Clean up use of single- vs double-quotes in strings - DONE
    14) fix import_all_voter_files() log outputs so that it works when executing the donow() function
    15) add output feedback on each county being imported
'''
 
 
 
''' This program is intended to extract a random-sample of emails from the Florida voter file for use in email surveys. '''
 
#%% Importing packages, declaring variables and functions
import datetime
import os
import pandas as pd
import glob
import numpy as np
import math
pd.set_option('display.max_columns', None) #allows better error-checking
 
# initializing lists and dictionaries. Please see these files in the 'data' folder in case the column names change.
running_as_console = False
columns_keep = []
header_voterfile = []
lookup_sachsregions = {}
lookup_county_code_to_name = {}
lookup_race = {}
#is this 'settings' variable necessary with the settings file?
settings = {'path_voterfile':'C:/Users/JohnEvans/Documents/20220412_VoterDetail/',
            'outputdir':'C:/Users/JohnEvans/Documents/Voterfile_Samples/',
            'sample_size':'20000',
            'slice_size':'5000'} #see settings file for sample_size, slice_size, and directory paths
sample_size = 20000
slice_size =  5000
application_log = []
allvoterfiles = pd.DataFrame({})
time_start = datetime.datetime.now()
 
 
def timeNow():
    return str(datetime.datetime.now().strftime('%Y-%m-%d T%H;%M;%S'))
 
def timeElapsed():
    global time_start
    time_since_start = datetime.datetime.now() - time_start
    application_log_builder('Time end: ' + timeNow() + '\n' + 'Time elapsed: ' + str(time_since_start))
 
 
def application_log_builder(string):
    global application_log
    application_log.append(string)
    if running_as_console:
        print(string)
    else:
        print(string) # update tkinter widget when this program moves to UI-based app
 
def save_application_log():
    reportFilename = 'Sample Extractor Report '+timeNow()+'.txt'
    with open(os.path.join(settings['outputdir'],reportFilename),'a') as text:
        text.write('\n'.join(application_log))
 
 
def read_into_dictionary(path): #this is to allow custom data to be stored externally
    temp_dict = {}
    with open(path,'r') as text:
            for line in text:
                delimiter = line.index(':')
                key = line.rstrip('\r\n')[:delimiter]
                value = line.rstrip('\r\n')[delimiter+1:]
                temp_dict[key] = value
    return temp_dict
 
 
def load_variables(): #loading external data
    global columns_keep
    global header_voterfile
    global lookup_sachsregions
    global lookup_county_code_to_name
    global lookup_race
    global settings
    with open(os.path.join('data', 'columns_keep.txt')) as f:
        columns_keep = [word.rstrip() for word in f]
 
    with open(os.path.join('data', 'header_voterfile.txt')) as f:
        header_voterfile = [word.rstrip() for word in f]
 
    lookup_sachsregions = read_into_dictionary(os.path.join('data', 'lookup_sachsregions.txt'))
    lookup_county_code_to_name = read_into_dictionary(os.path.join('data', 'lookup_county_code_to_name.txt'))
    lookup_race = read_into_dictionary(os.path.join('data', 'lookup_race.txt'))
    settings = read_into_dictionary(os.path.join('data', 'settings.txt'))
    sample_size = int(settings['sample_size'])
    slice_size = int(settings['slice_size'])
 
def show_vf_layout(): #opening a text copy of the voter file extract layout (data dictionary) in the installed text editor.
    os.startfile(os.path.join('data', 'voter_file_extract_layout.txt'))
 
def filter(df, column, filter_by): # this is for filtering a sample by a set of counties, races, etc.
    # for continuous intervals, see: https://stackoverflow.com/questions/30624599/pandas-isin-function-for-continuous-intervals
    if type(filter_by) is not list:
        filter_by = [filter_by]
    filteredDF = df.loc[df[column].isin(filter_by)]
    return filteredDF
 
def import_all_voter_files():             #  https://stackoverflow.com/a/36416258
    #Note, to get counts by category, use: allvoterfiles['Party_Affiliation'].value_counts()
    global allvoterfiles
 
    load_variables()
    application_log_builder('Reading all voter files from directory: \n' + settings['path_voterfile'] + '\n' + 'Time start: ' + timeNow())
 
    path = settings['path_voterfile']
    all_files = glob.glob(os.path.join(path, '*.txt'))  # all_files is a list of all files
 
    # looping through all_files to add each county's voterfile as an element in a list - a list of dataframes
    #on second thought - I think I want an message output signalling each state's file being worked through. It should just be a "print(f)" put in the right place.
    #df_from_each_file = (pd.read_csv(f,
    #                                 sep='\t',
    #                                 header=None,
    #                                 names=header_voterfile,
    #                                 dtype=str) for f in all_files) # Is there any advantage to making Voter_ID the index? 
 
 
    df_from_each_file = []
    file_sizes = []
    progressbar_value = 0
 
 
    for s in all_files:
        file_sizes.append(os.path.getsize(s))
 
    print(timeNow() + ":  " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*100))+"%  beginning collection") #sanitycheck
 
    # collecting all voter files
    for f in all_files:
        df_from_each_file.append(pd.read_csv(f,
                                     sep='\t',
                                     header=None,
                                     names=header_voterfile,
                                     dtype=str))
 
        progressbar_value = sum(file_sizes[0:all_files.index(f)+1])
        print(timeNow() + ":  " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*10000)/100)+"%  " + "{: >8.2f}".format(file_sizes[all_files.index(f)]/1048576) + "MB " + f) #sanitycheck
 
 
 
 
    allvoterfiles = pd.concat(df_from_each_file, ignore_index=True)
 
    allvoterfiles.info() #SANITYCHECK
 
    timeElapsed()
 
 
#QUESTION: Why does the calculate_age() function throw an error PRIOR to dropping columns using df[columns_keep]? Why doesn't it work for the start?
def calculate_age(df):
    application_log_builder('Calculating ages...')
    #df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], errors='ignore')
    df["Birth_Date"] = pd.to_datetime(df["Birth_Date"]) #This doesn't seem to work when used in this context. When used on its own it seems to work. Why does it not work in this context? Why the inconsistency? How to make work in this context?
    #NEW VERSION? df["Birth_Date"] = pd.to_datetime(df["Birth_Date"], errors='coerce') # errors = 'coerce' is meant to push past any entries that do not resemble dates.
    today = datetime.datetime.now()
    df['Age'] = today - df['Birth_Date']
    df['Age'] = df['Age'].dt.days
    df['Age'] = np.floor(df['Age']/365.2425).astype('int')
    timeElapsed()
    return df
 
#def calculate_age():
#    try:
#        return dt.datetime.strptime(x, '%m/%d/%Y')
#    except:
#        return pd.NaT
##then use: df['Birth_Date'].apply(calculate_age)
 
#def calculate_age(df):
#    df.loc[df['Birth_Date'] !=float("NaN"), 'Birth_Date'] = pd.to_datetime(df['Birth_Date'])
 
def clean_voter_file(df):
    application_log_builder('Cleaning voter file . . .')
    application_log_builder('Cleaning up NaNs . . .')
    df = df.replace(np.nan, '', regex=True)
    application_log_builder('Cleaning up *s . . .')
    df = df.replace('*', '', regex=False)
    application_log_builder('Trimming blank spaces from email addresses . . .')
    df['Email_Address'] = df['Email_Address'].str.strip()
    timeElapsed()
    return df
 
 
def delete_entries_without_emails(df):
    application_log_builder('Deleting entries that do not have email addresses . . .')
    df = df[df['Email_Address'].str.len()>2]
    timeElapsed()
    return df
 
 
#%% MAIN sample-extraction function
def donow():
    global time_start
 
    global allvoterfiles
 
    import_all_voter_files()
 
    allvoterfiles = delete_entries_without_emails(allvoterfiles)
    #allvoterfiles = allvoterfiles[allvoterfiles['Email_Address'].str.len()>2] #replace with delete_entries_without_emails()?
    allvoterfiles = allvoterfiles[columns_keep] # eliminating extraneous columns
    allvoterfiles = allvoterfiles.replace({'County_Code':lookup_county_code_to_name}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
    #allvoterfiles = allvoterfiles.replace({'County_Code':lookup_region}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
    allvoterfiles = allvoterfiles.replace({'Race':lookup_race}) # re-coding Race codes as verbose racial labels using a lookup-dictionary
    allvoterfiles = clean_voter_file(allvoterfiles) # trimming whitespace and NaNs
 
    allvoterfiles = calculate_age(allvoterfiles)
    allvoterfiles = allvoterfiles.drop(columns=['Birth_Date'])
 
    application_log_builder('Cutting a sample of: ' + settings['sample_size'] +' entries')
 
    allvoterfiles_sample = allvoterfiles.sample(n=sample_size)
    timeSampleExtracted = timeNow()
    timeElapsed()
 
    application_log_builder('Exporting sample slices...')
    # I think I should turn the following for-loop into a function - ADD TO TO-DO?
    # I also think I should generalize this donow() function so that ALL counties or just one county can be extracted - ADD TO TO-DO?
    for start in range(0, allvoterfiles_sample.shape[0], slice_size): #can we do allvoterfiles_sample[columns_keep] instead? So that we can export BOTH a SurveyMonkey-ready sample with 8 columns, and another version of the sample sample with ALL columns (for rematching if necessary)?
 
        subset = allvoterfiles_sample.iloc[start:start + slice_size]
 
        #creating numerical labels for each slice
        section = str(int(start/slice_size)+1)
        sectionTotal = str(int(math.ceil(sample_size/slice_size)))
 
        #building filename string
        startingEntry = str(start+1).zfill(5)
        endingEntry = str(int(min(sample_size, start+slice_size))).zfill(5)
        outputFilename = timeSampleExtracted + ' ' + startingEntry + '-' + endingEntry + ' sample of ' + settings['sample_size'] + '.csv'
 
        #output message
        application_log_builder('...exporting section ' + section + ' of ' + sectionTotal)
 
        subset.to_csv(os.path.join(settings['outputdir'],outputFilename), header=False, index=False)
 
    application_log_builder('DONE!\n\nFor your files, look in directory: \n\t'+str(os.path.join(settings['outputdir'])))
    timeElapsed()
 
 
 
 
 
if __name__ == "__main__":
    running_as_console = True
    application_log_builder('Welcome to the SMG Florida Voter-file handler!'+'\n'+'The time is: ' + timeNow())
    load_variables()
 
 
##ASK FOR INPUT
## DO MAGIC STUFF
## Set variable path_voterfile to whatever the user input
##FINALLY CALL DONOW()