#%% Introduction # -*- coding: utf-8 -*- ''' put .py file in its own folder (maybe on its separate drive, try usb drive) ''' #%% TO-DO LIST ''' TO DO: 13) build executable/GUI using 'gooey' library (Or using 'Python GTK', maybe also with Glade) 16) HIGHER PRIORITY: Along with a set of csv files fit for SurveyMonkey, ALSO output a companion file, using the SAME entries, but including ALL columns. This allows for a DIY rematching in an elegant way for even basic users. 16a) This will require significant changes to the donow() function, where we SAMPLE BEFORE cutting out extraneous columns. 16b) The process will go something like (after importing): 16b.01) Delete entries without emails 16b.02) Recode county codes to county names 16b.03) Recode race codes to race names 16b.04) Trim whitespace and NaNs 16b.05) Calculate age 16b.051) (Delete the line to drop Birth_Date. Should be unnecessary if we drop extraneous columns with columns_keep last.) 16b.06) Cut sample 16b.07) Output all-column version 16b.08) Cull extraneous columns 16b.09) THEN output slices using the current loop programming 16b.10) End the process 17) Low priority: Adjust the use of times in progress feedback. Maybe only use time elapsed ONCE at the end? 18) For end-product application, I want to add more lines to the settings file. Like: 18.1) The settings file should include the following categories: 18.1.1) Settings, original - the original "factory-default" settings for the application. This is so that if the settings ever become broken, this remains a failsafe. 18.1.2) Settings, user defaults - this is to allow ease-of-use on a regular day-to-day 18.1.3) Settings, for current project - this is for special cases 18.2) The settings base file should be easy for the GUI package to interpret, making it easy for a user to know what settings they are changing 18.3) The settings base file probably needs, for each 'setting': 18.3.1) The associated variable name 18.3.2) the associated variable type 18.3.3) The gui text describing the setting 18.3.4) A long descripiton (for the help file) explaining each setting and how to use it 18.3.5) The actual value, string, or list that the setting variable contains 18.4) columns_keep should be an option in the settings 18.5) counties_filter: A list of strings to allow the user to select which counties they want to have in their sample 18.6) races_filter: A list of strings to allow the user to select which races they want to have in their sample 18.7) political_party_filter: A list of strings to allow the user to select which political parties they want to have in their sample 18.8) optional_filter_1_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings 18.9) optional_filter_1_list: A list of strings to allow the user to select which values in optional filter #1 they want to have in their sample 18.10) optional_filter_2_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings 18.11) optional_filter_2_list: A list of strings to allow the user to select which values in optional filter #2 they want to have in their sample 18.12) optional_filter_3_column_name: A string to allow the user to choose a column for optional filter #1. Needs an associated list of strings 18.13) optional_filter_3_list: A list of strings to allow the user to select which values in optional filter #3 they want to have in their sample ZZ) LOWEST PRIORITY: Is there a way to EXTREMELY reduce the memory requirements? ZZa) Maybe count the number of lines per file, building an index to ALL rows available in all files (or maybe all rows with valid emails), then drawing a sample from that index, then drawing / cleaning / appending line by line to new files. # in the final product, it would be cool to auto-detect counties, races, ages, zip codes, and maybe some other things 4) Find a way to turn the code into a standalone program, that works in a VERY user friendly way, such that even an intern can extract a voter file sample. 4-) SEE THIS: https://docs.python-guide.org/shipping/freezing/ 4a) http://www.py2exe.org/index.cgi/Tutorial 4b) Detecting the presence of the voter files, and building a path to them? 4c) Customized sample/slice sizing? 4d) GUI? https://www.tutorialspoint.com/python/python_gui_programming.htm 4e) Folder selection dialog: https://stackoverflow.com/questions/11295917/how-to-select-a-directory-and-store-the-location-using-tkinter-in-python 4f) Help file: using the app, troubleshooting, using the voter file 4g) os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) opens up a text file copy of the voter file extract layout document 4h) OTHER GUI OPTION: 'Python GTK': https://python-gtk-3-tutorial.readthedocs.io/en/latest/introduction.html 4h.1) https://stackoverflow.com/questions/17698138/python-gui-programming-using-drag-and-drop-also-incorporating-stdout-redirect 4h.2) https://zetcode.com/python/gtk/ 4i) Glade: https://glade.gnome.org/ 12) an option to save settings. For example, the default 'settings' should be the user-defined source and output directory, as well as the standard set of columns and sample/slice sizes. But it would be nice to load other settings. DONE (success) ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 1) Try bulk importing without regard to memory usage. Try to overload the computer. 1a) Loop to iterate through voter files and concatenate them. 1b) Even consider concatenating whole-files, and not just rows with emails, just to stress-test. Performance in this regard sets the stage for 538-type voter-history analysis. 2) Build random-sample extractor 3) Currently, the email-sensor only eliminates blank and '*' emails. Can we check for and eliminate improperly formatted emails? 3a) searchfor : #EMAIL SENSOR RESPONSE: As it is, it's probably optimized. I do not capture anything but email addresses (even if improperly formatted) but cleaning improperly formatted email addresses is WAY out of scope. 5) Is garbage collection necessary? https://stackoverflow.com/a/39101287 NOT NECESSARY IN THIS PROGRAM X) when compliling to an EXE, consider storing in an end-user editable file, like a JSON to future proof against column name DONE! See 'data' folder X) in final product, would it be possible to auto-detect column headers? No, because the voter file has no headers. 6) check read_into_dictionary for \r\n compatability 7) slice & sample output loop needs refactoring and cleaning 7a) ( for start in range(0, allvoterfiles_sample.shape[0], slice_size): ) X) all_voter_files progress tracker: see the python file: 'progress tracker example [see lines 128 to 150].py' 8) import_all_voter_files needs refactoring 8a) consider cleaning / or better defining how import_all_voter_files() uses globals or 'allvoterfiles' variable 8b) the goal is to load all voter files into a single dataframe for editing and for setting up the donow() algorithm 9) clean_voter_file - does this need to to be factored out? 10) date_voterfile & path_voterfile & output dir should be treated the same as the lookup dictionaries, so that the final GUI app can ask the user to define them on first launch, then treat the user-defined variables as defaults from then on. 11) Clean up use of single- vs double-quotes in strings - DONE 14) fix import_all_voter_files() log outputs so that it works when executing the donow() function 15) add output feedback on each county being imported ''' ''' This program is intended to extract a random-sample of emails from the Florida voter file for use in email surveys. ''' #%% Importing packages, declaring variables and functions import datetime import os import pandas as pd import glob import numpy as np import math pd.set_option('display.max_columns', None) #allows better error-checking # initializing lists and dictionaries. Please see these files in the 'data' folder in case the column names change. running_as_console = False columns_keep = [] header_voterfile = [] lookup_sachsregions = {} lookup_county_code_to_name = {} lookup_race = {} #is this 'settings' variable necessary with the settings file? settings = {'path_voterfile':'C:/Users/JohnEvans/Documents/20220412_VoterDetail/', 'outputdir':'C:/Users/JohnEvans/Documents/Voterfile_Samples/', 'sample_size':'20000', 'slice_size':'5000'} #see settings file for sample_size, slice_size, and directory paths sample_size = 20000 slice_size = 5000 application_log = [] allvoterfiles = pd.DataFrame({}) time_start = datetime.datetime.now() def timeNow(): return str(datetime.datetime.now().strftime('%Y-%m-%d T%H;%M;%S')) def timeElapsed(): global time_start time_since_start = datetime.datetime.now() - time_start application_log_builder('Time end: ' + timeNow() + '\n' + 'Time elapsed: ' + str(time_since_start)) def application_log_builder(string): global application_log application_log.append(string) if running_as_console: print(string) else: print(string) # update tkinter widget when this program moves to UI-based app def save_application_log(): reportFilename = 'Sample Extractor Report '+timeNow()+'.txt' with open(os.path.join(settings['outputdir'],reportFilename),'a') as text: text.write('\n'.join(application_log)) def read_into_dictionary(path): #this is to allow custom data to be stored externally temp_dict = {} with open(path,'r') as text: for line in text: delimiter = line.index(':') key = line.rstrip('\r\n')[:delimiter] value = line.rstrip('\r\n')[delimiter+1:] temp_dict[key] = value return temp_dict def load_variables(): #loading external data global columns_keep global header_voterfile global lookup_sachsregions global lookup_county_code_to_name global lookup_race global settings with open(os.path.join('data', 'columns_keep.txt')) as f: columns_keep = [word.rstrip() for word in f] with open(os.path.join('data', 'header_voterfile.txt')) as f: header_voterfile = [word.rstrip() for word in f] lookup_sachsregions = read_into_dictionary(os.path.join('data', 'lookup_sachsregions.txt')) lookup_county_code_to_name = read_into_dictionary(os.path.join('data', 'lookup_county_code_to_name.txt')) lookup_race = read_into_dictionary(os.path.join('data', 'lookup_race.txt')) settings = read_into_dictionary(os.path.join('data', 'settings.txt')) sample_size = int(settings['sample_size']) slice_size = int(settings['slice_size']) def show_vf_layout(): #opening a text copy of the voter file extract layout (data dictionary) in the installed text editor. os.startfile(os.path.join('data', 'voter_file_extract_layout.txt')) def filter(df, column, filter_by): # this is for filtering a sample by a set of counties, races, etc. # for continuous intervals, see: https://stackoverflow.com/questions/30624599/pandas-isin-function-for-continuous-intervals if type(filter_by) is not list: filter_by = [filter_by] filteredDF = df.loc[df[column].isin(filter_by)] return filteredDF def import_all_voter_files(): # https://stackoverflow.com/a/36416258 #Note, to get counts by category, use: allvoterfiles['Party_Affiliation'].value_counts() global allvoterfiles load_variables() application_log_builder('Reading all voter files from directory: \n' + settings['path_voterfile'] + '\n' + 'Time start: ' + timeNow()) path = settings['path_voterfile'] all_files = glob.glob(os.path.join(path, '*.txt')) # all_files is a list of all files # looping through all_files to add each county's voterfile as an element in a list - a list of dataframes #on second thought - I think I want an message output signalling each state's file being worked through. It should just be a "print(f)" put in the right place. #df_from_each_file = (pd.read_csv(f, # sep='\t', # header=None, # names=header_voterfile, # dtype=str) for f in all_files) # Is there any advantage to making Voter_ID the index? df_from_each_file = [] file_sizes = [] progressbar_value = 0 for s in all_files: file_sizes.append(os.path.getsize(s)) print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*100))+"% beginning collection") #sanitycheck # collecting all voter files for f in all_files: df_from_each_file.append(pd.read_csv(f, sep='\t', header=None, names=header_voterfile, dtype=str)) progressbar_value = sum(file_sizes[0:all_files.index(f)+1]) print(timeNow() + ": " + "{: >8.2f}".format(int(progressbar_value/sum(file_sizes)*10000)/100)+"% " + "{: >8.2f}".format(file_sizes[all_files.index(f)]/1048576) + "MB " + f) #sanitycheck allvoterfiles = pd.concat(df_from_each_file, ignore_index=True) allvoterfiles.info() #SANITYCHECK timeElapsed() #QUESTION: Why does the calculate_age() function throw an error PRIOR to dropping columns using df[columns_keep]? Why doesn't it work for the start? def calculate_age(df): application_log_builder('Calculating ages...') #df['Birth_Date'] = pd.to_datetime(df['Birth_Date'], errors='ignore') df["Birth_Date"] = pd.to_datetime(df["Birth_Date"]) #This doesn't seem to work when used in this context. When used on its own it seems to work. Why does it not work in this context? Why the inconsistency? How to make work in this context? #NEW VERSION? df["Birth_Date"] = pd.to_datetime(df["Birth_Date"], errors='coerce') # errors = 'coerce' is meant to push past any entries that do not resemble dates. today = datetime.datetime.now() df['Age'] = today - df['Birth_Date'] df['Age'] = df['Age'].dt.days df['Age'] = np.floor(df['Age']/365.2425).astype('int') timeElapsed() return df #def calculate_age(): # try: # return dt.datetime.strptime(x, '%m/%d/%Y') # except: # return pd.NaT ##then use: df['Birth_Date'].apply(calculate_age) #def calculate_age(df): # df.loc[df['Birth_Date'] !=float("NaN"), 'Birth_Date'] = pd.to_datetime(df['Birth_Date']) def clean_voter_file(df): application_log_builder('Cleaning voter file . . .') application_log_builder('Cleaning up NaNs . . .') df = df.replace(np.nan, '', regex=True) application_log_builder('Cleaning up *s . . .') df = df.replace('*', '', regex=False) application_log_builder('Trimming blank spaces from email addresses . . .') df['Email_Address'] = df['Email_Address'].str.strip() timeElapsed() return df def delete_entries_without_emails(df): application_log_builder('Deleting entries that do not have email addresses . . .') df = df[df['Email_Address'].str.len()>2] timeElapsed() return df #%% MAIN sample-extraction function def donow(): global time_start global allvoterfiles import_all_voter_files() allvoterfiles = delete_entries_without_emails(allvoterfiles) #allvoterfiles = allvoterfiles[allvoterfiles['Email_Address'].str.len()>2] #replace with delete_entries_without_emails()? allvoterfiles = allvoterfiles[columns_keep] # eliminating extraneous columns allvoterfiles = allvoterfiles.replace({'County_Code':lookup_county_code_to_name}) # re-coding County_Code as Sachs Regions using a lookup-dictionary #allvoterfiles = allvoterfiles.replace({'County_Code':lookup_region}) # re-coding County_Code as Sachs Regions using a lookup-dictionary allvoterfiles = allvoterfiles.replace({'Race':lookup_race}) # re-coding Race codes as verbose racial labels using a lookup-dictionary allvoterfiles = clean_voter_file(allvoterfiles) # trimming whitespace and NaNs allvoterfiles = calculate_age(allvoterfiles) allvoterfiles = allvoterfiles.drop(columns=['Birth_Date']) application_log_builder('Cutting a sample of: ' + settings['sample_size'] +' entries') allvoterfiles_sample = allvoterfiles.sample(n=sample_size) timeSampleExtracted = timeNow() timeElapsed() application_log_builder('Exporting sample slices...') # I think I should turn the following for-loop into a function - ADD TO TO-DO? # I also think I should generalize this donow() function so that ALL counties or just one county can be extracted - ADD TO TO-DO? for start in range(0, allvoterfiles_sample.shape[0], slice_size): #can we do allvoterfiles_sample[columns_keep] instead? So that we can export BOTH a SurveyMonkey-ready sample with 8 columns, and another version of the sample sample with ALL columns (for rematching if necessary)? subset = allvoterfiles_sample.iloc[start:start + slice_size] #creating numerical labels for each slice section = str(int(start/slice_size)+1) sectionTotal = str(int(math.ceil(sample_size/slice_size))) #building filename string startingEntry = str(start+1).zfill(5) endingEntry = str(int(min(sample_size, start+slice_size))).zfill(5) outputFilename = timeSampleExtracted + ' ' + startingEntry + '-' + endingEntry + ' sample of ' + settings['sample_size'] + '.csv' #output message application_log_builder('...exporting section ' + section + ' of ' + sectionTotal) subset.to_csv(os.path.join(settings['outputdir'],outputFilename), header=False, index=False) application_log_builder('DONE!\n\nFor your files, look in directory: \n\t'+str(os.path.join(settings['outputdir']))) timeElapsed() if __name__ == "__main__": running_as_console = True application_log_builder('Welcome to the SMG Florida Voter-file handler!'+'\n'+'The time is: ' + timeNow()) load_variables() ##ASK FOR INPUT ## DO MAGIC STUFF ## Set variable path_voterfile to whatever the user input ##FINALLY CALL DONOW()