#%% Introduction
# -*- coding: utf-8 -*-
'''
put .py file in its own folder (maybe on its separate drive, try usb drive)
'''
#%% TO-DO LIST
'''
TO DO:
13) build executable/GUI using 'gooey' library (Or using 'Python GTK', maybe also with Glade)
16) HIGHER PRIORITY: Along with a set of csv files fit for SurveyMonkey, ALSO output a companion file, using the SAME entries, but including ALL columns. This allows for a DIY rematching in an elegant way for even basic users.
16a) This will require significant changes to the donow() function, where we SAMPLE BEFORE cutting out extraneous columns.
16b) The process will go something like (after importing):
16b.01) Delete entries without emails
16b.02) Recode county codes to county names
16b.03) Recode race codes to race names
16b.04) Trim whitespace and NaNs
16b.05) Calculate age
16b.051) (Delete the line to drop Birth_Date. Should be unnecessary if we drop extraneous columns with columns_keep last.)
16b.06) Cut sample
16b.07) Output all-column version
16b.08) Cull extraneous columns
16b.09) THEN output slices using the current loop programming
16b.10) End the process
17) Low priority: Adjust the use of times in progress feedback. Maybe only use time elapsed ONCE at the end?
18) For end-product application, I want to add more lines to the settings file. Like:
18.1) The settings file should include the following categories:
18.1.1) Settings, original - the original "factory-default" settings for the application. This is so that if the settings ever become broken, this remains a failsafe.
18.1.2) Settings, user defaults - this is to allow ease-of-use on a regular day-to-day
18.1.3) Settings, for current project - this is for special cases
18.2) The settings base file should be easy for the GUI package to interpret, making it easy for a user to know what settings they are changing
18.3) The settings base file probably needs, for each 'setting':
18.3.1) The associated variable name
18.3.2) the associated variable type
18.3.3) The gui text describing the setting
18.3.4) A long descripiton (for the help file) explaining each setting and how to use it
18.3.5) The actual value, string, or list that the setting variable contains
18.4) columns_keep should be an option in the settings
18.5) counties_filter: A list of strings to allow the user to select which counties they want to have in their sample
18.6) races_filter: A list of strings to allow the user to select which races they want to have in their sample
18.7) political_party_filter: A list of strings to allow the user to select which political parties they want to have in their sample
18.8) optional_filter_1_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
18.9) optional_filter_1_list: A list of strings to allow the user to select which values in optional filter #1 they want to have in their sample
18.10) optional_filter_2_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
18.11) optional_filter_2_list: A list of strings to allow the user to select which values in optional filter #2 they want to have in their sample
18.12) optional_filter_3_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings
18.13) optional_filter_3_list: A list of strings to allow the user to select which values in optional filter #3 they want to have in their sample
ZZ) LOWEST PRIORITY: Is there a way to EXTREMELY reduce the memory requirements?
ZZa) Maybe count the number of lines per file,
building an index to ALL rows available in all files
(or maybe all rows with valid emails),
then drawing a sample from that index,
then drawing / cleaning / appending line by line to new files.
# in the final product, it would be cool to auto-detect counties, races, ages, zip codes, and maybe some other things
4) Find a way to turn the code into a standalone program, that works in a VERY user friendly way, such that even an intern can extract a voter file sample.
4-) SEE THIS: https://docs.python-guide.org/shipping/freezing/
4a) http://www.py2exe.org/index.cgi/Tutorial
4b) Detecting the presence of the voter files, and building a path to them?
4c) Customized sample/slice sizing?
4d) GUI? https://www.tutorialspoint.com/python/python_gui_programming.htm
4e) Folder selection dialog: https://stackoverflow.com/questions/11295917/how-to-select-a-directory-and-store-the-location-using-tkinter-in-python
4f) Help file: using the app, troubleshooting, using the voter file
4g) os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) opens up a text file copy of the voter file extract layout document
4h) OTHER GUI OPTION: 'Python GTK': https://python-gtk-3-tutorial.readthedocs.io/en/latest/introduction.html
4h.1) https://stackoverflow.com/questions/17698138/python-gui-programming-using-drag-and-drop-also-incorporating-stdout-redirect
4h.2) https://zetcode.com/python/gtk/
4i) Glade: https://glade.gnome.org/
12) an option to save settings. For example, the default 'settings' should be the user-defined source and output directory, as well as the standard set of columns and sample/slice sizes. But it would be nice to load other settings.
DONE (success) ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1) Try bulk importing without regard to memory usage. Try to overload the computer.
1a) Loop to iterate through voter files and concatenate them.
1b) Even consider concatenating whole-files, and not just rows with emails, just to stress-test. Performance in this regard sets the stage for 538-type voter-history analysis.
2) Build random-sample extractor
3) Currently, the email-sensor only eliminates blank and '*' emails. Can we check for and eliminate improperly formatted emails?
3a) searchfor : #EMAIL SENSOR
RESPONSE: As it is, it's probably optimized. I do not capture anything but email addresses (even if improperly formatted) but cleaning improperly formatted email addresses is WAY out of scope.
5) Is garbage collection necessary? https://stackoverflow.com/a/39101287 NOT NECESSARY IN THIS PROGRAM
X) when compliling to an EXE, consider storing in an end-user editable file, like a JSON to future proof against column name
DONE! See 'data' folder
X) in final product, would it be possible to auto-detect column headers?
No, because the voter file has no headers.
6) check read_into_dictionary for \r\n compatability
7) slice & sample output loop needs refactoring and cleaning
7a) ( for start in range(0, allvoterfiles_sample.shape[0], slice_size): )
X) all_voter_files progress tracker: see the python file: 'progress tracker example [see lines 128 to 150].py'
8) import_all_voter_files needs refactoring
8a) consider cleaning / or better defining how import_all_voter_files() uses globals or 'allvoterfiles' variable
8b) the goal is to load all voter files into a single dataframe for editing and for setting up the donow() algorithm
9) clean_voter_file - does this need to to be factored out?
10) date_voterfile & path_voterfile & output dir should be treated the same as the lookup dictionaries, so that the final GUI app can ask the user to define them on first launch, then treat the user-defined variables as defaults from then on.
11) Clean up use of single- vs double-quotes in strings - DONE
14) fix import_all_voter_files() log outputs so that it works when executing the donow() function
15) add output feedback on each county being imported
'''
''' This program is intended to extract a random-sample of emails from the Florida voter file for use in email surveys. '''
#%% Importing packages, declaring variables and functions
import datetime
import os
import pandas as pd
import glob
import numpy as np
import math
pd.set_option('display.max_columns', None) #allows better error-checking
# initializing lists and dictionaries. Please see these files in the 'data' folder in case the column names change.
running_as_console = False
columns_keep = []
header_voterfile = []
lookup_sachsregions = {}
lookup_county_code_to_name = {}
lookup_race = {}
#is this 'settings' variable necessary with the settings file?
settings = {'path_voterfile':'C:/Users/JohnEvans/Documents/20220412_VoterDetail/',
'outputdir':'C:/Users/JohnEvans/Documents/Voterfile_Samples/',
'sample_size':'20000',
'slice_size':'5000'} #see settings file for sample_size, slice_size, and directory paths
sample_size = 20000
slice_size = 5000
application_log = []
allvoterfiles = pd.DataFrame({})
time_start = datetime.datetime.now()
def timeNow():
return str(datetime.datetime.now().strftime('%Y-%m-%d T%H;%M;%S'))
def timeElapsed():
global time_start
time_since_start = datetime.datetime.now() - time_start
application_log_builder('Time end: ' + timeNow() + '\n' + 'Time elapsed: ' + str(time_since_start))
def application_log_builder(string):
global application_log
application_log.append(string)
if running_as_console:
print(string)
else:
print(string) # update tkinter widget when this program moves to UI-based app
def save_application_log():
reportFilename = 'Sample Extractor Report '+timeNow()+'.txt'
with open(os.path.join(settings['outputdir'],reportFilename),'a') as text:
text.write('\n'.join(application_log))
def read_into_dictionary(path): #this is to allow custom data to be stored externally
temp_dict = {}
with open(path,'r') as text:
for line in text:
delimiter = line.index(':')
key = line.rstrip('\r\n')[:delimiter]
value = line.rstrip('\r\n')[delimiter+1:]
temp_dict[key] = value
return temp_dict
def load_variables(): #loading external data
global columns_keep
global header_voterfile
global lookup_sachsregions
global lookup_county_code_to_name
global lookup_race
global settings
with open(os.path.join('data', 'columns_keep.txt')) as f:
columns_keep = [word.rstrip() for word in f]
with open(os.path.join('data', 'header_voterfile.txt')) as f:
header_voterfile = [word.rstrip() for word in f]
lookup_sachsregions = read_into_dictionary(os.path.join('data', 'lookup_sachsregions.txt'))
lookup_county_code_to_name = read_into_dictionary(os.path.join('data', 'lookup_county_code_to_name.txt'))
lookup_race = read_into_dictionary(os.path.join('data', 'lookup_race.txt'))
settings = read_into_dictionary(os.path.join('data', 'settings.txt'))
sample_size = int(settings['sample_size'])
slice_size = int(settings['slice_size'])
def show_vf_layout(): #opening a text copy of the voter file extract layout (data dictionary) in the installed text editor.
os.startfile(os.path.join('data', 'voter_file_extract_layout.txt'))
def filter(df, column, filter_by): # this is for filtering a sample by a set of counties, races, etc.
# for continuous intervals, see: https://stackoverflow.com/questions/30624599/pandas-isin-function-for-continuous-intervals
if type(filter_by) is not list:
filter_by = [filter_by]
filteredDF = df.loc[df[column].isin(filter_by)]
return filteredDF
def import_all_voter_files(): # https://stackoverflow.com/a/36416258
#Note, to get counts by category, use: allvoterfiles['Party_Affiliation'].value_counts()
global allvoterfiles
load_variables()
application_log_builder('Reading all voter files from directory: \n' + settings['path_voterfile'] + '\n' + 'Time start: ' + timeNow())
path = settings['path_voterfile']
all_files = glob.glob(os.path.join(path, '*.txt')) # all_files is a list of all files
# looping through all_files to add each county's voterfile as an element in a list - a list of dataframes
#on second thought - I think I want an message output signalling each state's file being worked through. It should just be a "print(f)" put in the right place.
#df_from_each_file = (pd.read_csv(f,
# sep='\t',
# header=None,
# names=header_voterfile,
# dtype=str) for f in all_files) # Is there any advantage to making Voter_ID the index?
df_from_each_file = []
file_sizes = []
progressbar_value = 0
for s in all_files:
file_sizes.append(os.path.getsize(s))
print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*100))+"% beginning collection") #sanitycheck
# collecting all voter files
for f in all_files:
df_from_each_file.append(pd.read_csv(f,
sep='\t',
header=None,
names=header_voterfile,
dtype=str))
progressbar_value = sum(file_sizes[0:all_files.index(f)+1])
print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*10000)/100)+"% " + "{: >8.2f}".format(file_sizes[all_files.index(f)]/1048576) + "MB " + f) #sanitycheck
allvoterfiles = pd.concat(df_from_each_file, ignore_index=True)
allvoterfiles.info() #SANITYCHECK
timeElapsed()
#QUESTION: Why does the calculate_age() function throw an error PRIOR to dropping columns using df[columns_keep]? Why doesn't it work for the start?
def calculate_age(df):
application_log_builder('Calculating ages...')
#df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], errors='ignore')
df["Birth_Date"] = pd.to_datetime(df["Birth_Date"]) #This doesn't seem to work when used in this context. When used on its own it seems to work. Why does it not work in this context? Why the inconsistency? How to make work in this context?
#NEW VERSION? df["Birth_Date"] = pd.to_datetime(df["Birth_Date"], errors='coerce') # errors = 'coerce' is meant to push past any entries that do not resemble dates.
today = datetime.datetime.now()
df['Age'] = today - df['Birth_Date']
df['Age'] = df['Age'].dt.days
df['Age'] = np.floor(df['Age']/365.2425).astype('int')
timeElapsed()
return df
#def calculate_age():
# try:
# return dt.datetime.strptime(x, '%m/%d/%Y')
# except:
# return pd.NaT
##then use: df['Birth_Date'].apply(calculate_age)
#def calculate_age(df):
# df.loc[df['Birth_Date'] !=float("NaN"), 'Birth_Date'] = pd.to_datetime(df['Birth_Date'])
def clean_voter_file(df):
application_log_builder('Cleaning voter file . . .')
application_log_builder('Cleaning up NaNs . . .')
df = df.replace(np.nan, '', regex=True)
application_log_builder('Cleaning up *s . . .')
df = df.replace('*', '', regex=False)
application_log_builder('Trimming blank spaces from email addresses . . .')
df['Email_Address'] = df['Email_Address'].str.strip()
timeElapsed()
return df
def delete_entries_without_emails(df):
application_log_builder('Deleting entries that do not have email addresses . . .')
df = df[df['Email_Address'].str.len()>2]
timeElapsed()
return df
#%% MAIN sample-extraction function
def donow():
global time_start
global allvoterfiles
import_all_voter_files()
allvoterfiles = delete_entries_without_emails(allvoterfiles)
#allvoterfiles = allvoterfiles[allvoterfiles['Email_Address'].str.len()>2] #replace with delete_entries_without_emails()?
allvoterfiles = allvoterfiles[columns_keep] # eliminating extraneous columns
allvoterfiles = allvoterfiles.replace({'County_Code':lookup_county_code_to_name}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
#allvoterfiles = allvoterfiles.replace({'County_Code':lookup_region}) # re-coding County_Code as Sachs Regions using a lookup-dictionary
allvoterfiles = allvoterfiles.replace({'Race':lookup_race}) # re-coding Race codes as verbose racial labels using a lookup-dictionary
allvoterfiles = clean_voter_file(allvoterfiles) # trimming whitespace and NaNs
allvoterfiles = calculate_age(allvoterfiles)
allvoterfiles = allvoterfiles.drop(columns=['Birth_Date'])
application_log_builder('Cutting a sample of: ' + settings['sample_size'] +' entries')
allvoterfiles_sample = allvoterfiles.sample(n=sample_size)
timeSampleExtracted = timeNow()
timeElapsed()
application_log_builder('Exporting sample slices...')
# I think I should turn the following for-loop into a function - ADD TO TO-DO?
# I also think I should generalize this donow() function so that ALL counties or just one county can be extracted - ADD TO TO-DO?
for start in range(0, allvoterfiles_sample.shape[0], slice_size): #can we do allvoterfiles_sample[columns_keep] instead? So that we can export BOTH a SurveyMonkey-ready sample with 8 columns, and another version of the sample sample with ALL columns (for rematching if necessary)?
subset = allvoterfiles_sample.iloc[start:start + slice_size]
#creating numerical labels for each slice
section = str(int(start/slice_size)+1)
sectionTotal = str(int(math.ceil(sample_size/slice_size)))
#building filename string
startingEntry = str(start+1).zfill(5)
endingEntry = str(int(min(sample_size, start+slice_size))).zfill(5)
outputFilename = timeSampleExtracted + ' ' + startingEntry + '-' + endingEntry + ' sample of ' + settings['sample_size'] + '.csv'
#output message
application_log_builder('...exporting section ' + section + ' of ' + sectionTotal)
subset.to_csv(os.path.join(settings['outputdir'],outputFilename), header=False, index=False)
application_log_builder('DONE!\n\nFor your files, look in directory: \n\t'+str(os.path.join(settings['outputdir'])))
timeElapsed()
if __name__ == "__main__":
running_as_console = True
application_log_builder('Welcome to the SMG Florida Voter-file handler!'+'\n'+'The time is: ' + timeNow())
load_variables()
##ASK FOR INPUT
## DO MAGIC STUFF
## Set variable path_voterfile to whatever the user input
##FINALLY CALL DONOW()