#!/usr/bin/env python """ BeginDate:20050824 CurrentRevisionDate:20050824 Development Version : xayastats 001 Release Version: pre-release Filename: xayastats.py Filedescription: xayastats.py contains basic statistics and exploratory data analysis primitives. Author(s): Mishtu Banerjee Contact: mishtu@harmeny.com Copyright: The Authors License: Distributed under MIT License [http://opensource.org/licenses/mit-license.html] Environment: Programmed and tested under Python 2.3.3 on Windows 2000. Dependencies: Python Interpreter and base libraries. xayacore ============================================== XAYAstats_001 A Toolkit for Model Driven Data Analysis ============================================== XAYAstats adds basic statistics capabilities to the data processing capabilities of XAYAcore. XAYAstats adds control functions that call on the lower level genStats functions that operate on lists, so they can work with XAYA format data sets. This module does not use Numeric -- to limit dependencies It simply uses the Python built in fns. It uses a functional style, starting with simple stat functions, and building more complex stats from them like in the language R or in MatLab. Send bugs, fixes, suggestions to mishtu@harmeny.com (Thanks) Description of Functionality of XAYAstats The basic unit in XAYAstats is a 'Dataset'.. While primitives in XAYAstats operate on lists, the "control" functions in XAYAstats operate on Datasets (from which they extract and compose lists as needed. genStats assumes it's input is a "dataset". A dataset has 3 core components: 'VARIABLES', 'DATA', 'VARTYPES' There are several optional components: GROUPS, SORTS, VARDICTIONARY. Each of these components is represented as a XAYAcore format graph. A Dataset is a collection of these graphs, held in a Python dictionary. a dataset is a dictionary that contains several XAYA format graphs as its values:('DATA' and 'VARIABLES' are required, others are optional) 'DATA': keys are "autonumbers" representing a line of data, and values are the tuple of data 'VARIABLES': keys are the variable names, and values are the index position of that variable in the tuples in 'DATA'. 'VARTYPES'. keys are the variable names and values are the data type. There are five data types: 'INTEGER' -- a whole number 'NUMERICAL' -- any number with a decimal 'CNUMBER' -- a complex number -- not currently supported 'CATEGORICAL' -- a series of labels 'TEXT' -- free text. 'VARTYPES' in this case are a set of data-analysis centric categories. Additional types would exist in a database centric, or programming language centric characterization of data types. 'DATA' and 'VARIABLES' are mandatory keys, 'VARTYPES' is optional. In the abscence of explicit type information, data is cast based on a series of tests (see the function testTypes) Example of the structure of a XAYAstats dataset dataset = {'VARIABLES' : {'Var1' : [0], 'Var2': [1], 'Var3': [ 2] } 'DATA' : {1 : [1,2,3], 2 : [4, 5, 6]} 'TYPES' : {'Var1' : 'CATEGORICAL', 'Var2' : 'NUMERICAL', 'Var3' : 'INTEGER'} } Description of Functionality of XAYAstats data analysis primitives: These 'primitives' for data analysis generate basic statistics. They can be called individually. They can be combined. If you have a module to access dbs and files and represent the data as lists, then this package can calculate all basic stats on those lists. Probably 80 % of the stats people calculate are these very simple stats. UNIVARIATE STATISTICAL FUNCTIONS: getStats -- Calculates univariate statistics for a dataset based on a user-supplied list of selected statistics. Returns a XAYAstats format dataset. getSummaryStats -- Calculates a pre-defined list of summary statistics for a dataset. Returns results in a dictionary format that is easy to display with the pretty-print function. callStats -- Calls the individual statistics functions and returns a XAYAstats format dataset. Is itself called by getStats. Can call any of the stats primitives in the list below from minstat to cvstat. Can not call summarystat function. getHistograph -- returns a histograph from a dataset. See the histograph function for more details. minstat -- minimum maxstat -- maximum rangestat -- absolute range sumstat -- sum of values sumsqstat -- sum of squares of values meanstat -- mean medianstat -- median varstat -- variance of a sample sdstat -- standard deviation of a sample cvstat -- coefficient of variation -- biologists like it, for reasons that were never clear to me summarystat -- provides dictionary of basic summary statistics for a sample (not in XAYAstats format, but easy to view with pretty-print function) STATS VARIABLE TRANSFORMATIONS: dlist -- list of deviations from mean value zlist -- list of z scores -- deviation from mean value divided by standard deviation rlist -- dev from min divided by range -- list between 0 and 1 DISTRIBUTIONS: histograph -- Counts occurances of each unique item in a list and represents the results as a XAYAcore format graph. Description of functionality of XAYAstats dataset handling functions: Functions that operate on XAYAstats format Datasets getDataset : Creates a XAYAstats format dataset (calling readFile and getData) viewData : Allows one to view the Data components of a Dataset viewVariables: Allows one to view the Variables component of a Dataset saveData: Saves the Data component of a Dataset to a XAYAcore style text file. saveVAriable: Saves the Variable component of a Dataset to a XAYAcore style text file. getDatalist: Extracts a Variable from a Dataset -- does not cast to a particular variable type. (can be treated as either CATEGORICAL or TEXT) getNumberslist: Extracts a Variable as a list of numbers from a Dataset via castDatalistToNumerical getIntegerslist: Extracts a Variable as a list of integers. Functions that operate on a File readFile : Reads csv file into dictionary keyed to file line numbers. Headings at line 0. getData: Cleans up dictionary, strips off extra spaces, turns the dictionary value into a list of numbers. getVariables: From csv file header, creates a graph where the column name is the key, and the column index is the value. This dictionary extracts lists of values by column name. Functions that operate on Lists castDatalistToNumerical: Converts a Datalist to floating point numbers. castDatalistToInteger: Converts a Datalist to integers. getNumberslist: Extracts a Variable as a list of numbers from a Dataset via castDatalistToNumerical Functions that operate on individual data items testTypes -- Tests whether an item is INTEGER, NUMERICAL, CATEGORICAL, or TEXT Description of functionality of Utility Functions: These functions are designed to "round out" XAYAstats by providing it with additional data manipulation ('data munging' aka 'data crunching') abilities to further organize, clean-up, and transform data. Additionally there is support for basic sampling procedures, and basic probability calculations will be added soon. Sampling and audit Functions: auditDataset -- Produces an audit sample for a dataset, using a probability for inclusion based on the number of audit samples requested. The function will return approximately the number of samples requested. auditData -- Called by auditDataset. Handles the sampling process. Can be used to sample any XAYAcore format graph. sampleDataset -- Samples a dataset with or without replacement. This function can be used for various monte-carlo like simulations based on repeated sampling from a dataset. samplePopulationR -- Called by sampleDataset if replacement = 'Y'. samplePopulationNoR -- Called by sampleDataset if replacement = 'N' Probability Calculators diceroll -- approximates rolling a number of dice, each of which can have an aribtrary number of sides. For example, diceroll(5, 6) would be 5 rolls of a 6 sided dice Data munging Functions (aka data crunching ....) sortlist -- a Python 2.3 version of the "sorted" feature in Python 2.4. Sorts a copy of a list, and leaves the original list unchanged. SOURCES: Sources for specific statistics are cited in the docstring for the statistic. The rest of the statistics, just follow from their standard definitions. USAGE EXAMPLEs: (see examples under individual functions -- not implemented currently) REFERENCES: # KNOWN BUGS # None Known at this point # UNITTESTS # Unittests are in the file test_xayastats # DESIGN CONTRACTS # Design Contract for each fn/obj are currently described in comments/docstring #DOCTESTS # Used sparingly where they illustrate code above and beyond unittest """ import types from math import sqrt import xayacore import pprint import random # To allow the code to use built-in set fns in Python 2.4 or the sets module in Python 2.3 try : set except NameError: from sets import Set as set def xayastatsVersion(): return "Current version is xayastats_001 (development branch), updated September 6, 2005" # ---------- Utility Functions ---------- # These functions handle basic 'data munging' tasks, calculate probablities, # allow for sampling and audit of data # Basic math tricks for sums and series def listDatasets(objects = dir()): """ Utility function to identify currently loaded datasets. Function should be called with default parameters, ie as 'listDatasets()' """ datasetlist = [] for item in objects: try: if eval(item + '.' + 'has_key("DATA")') == True: datasetlist.append(item) except AttributeError: pass return datasetlist def auditDataset (auditsamples=1, dataset ={}, randseed=12345213): """ Wraps auditData function to operate on a XAYAstats format dataset. Type help(auditdata) for more information on the underlying function """ data = dataset['DATA'] auditsample = {} auditsample['DATA'] = auditData(auditsamples, data, randseed) auditsample['VARIABLES'] = dataset['VARIABLES'] return auditsample def auditData(auditsamples = 1, data = {}, randseed = 12345213 ): """ Samples a DATA graph based on a probability of inclusion which is the ratio auditsamples/len(data). So, one recieves approximately the number of samples selected. From Data Crunching pg 170. This function uses a fixed seed, so it's results for a particular input are repeatable. The default seed is taken from the example in Data Crunching. """ auditdata = {} random.seed(randseed) if (not data) or (len(data) <= auditsamples): auditdata = data return auditdata else: prob = float(auditsamples)/len(data) keys = data.keys() keys.remove(0) for key in keys: if random.random() < prob: auditdata[key] = data[key] auditdata[0] = data[0] return auditdata def samplePopulationR(samples, population): """ Samples a population with replacement The same population member can be sampled more than one time In this function we are imagining s rolls of a single dice whose number of sides is equal to the population size. As population approaches infinity, the dice becomes round ... but you'll be dead before it happens. """ samplesList = [ ] for sample in range(samples): randomsample = diceroll(1,population) samplesList.append(randomsample) return samplesList def samplePopulationsNoR(samples, population): """ Samples a population without replacement. If a population member is drawn once, it can not be drawn again. """ maxsamples = samples initialsamples = 0 samplesList = [ ] while (len(samplesList) < maxsamples) : if len(samplesList) == population: break thissample = diceroll(1,population) if thissample not in samplesList: samplesList.append(thissample) return samplesList def diceroll (dice, sides): """ Simulate rolling d dice of s sides each From Python Cookbook1st Edn pg 531 See the Python Cookbook entry on this function for the non-standard use of 'reduce' for example diceroll(5, 6) would be 5 rolls of a 6 sided dice """ def accumulate (x, y, s=sides): return x + random.randrange(s) return reduce(accumulate, range(dice+1) )+ dice def sortlist(iterable = [ ]): """ A utility function that returns a sorted list, but recieves the original list unchanged: from The Python Cookbook 2nd Edn pg 192 """ alist = list(iterable) alist.sort() return alist def sumIntegerSeries(start = 1, stop = 100, difference = 1): """ This formula is taken from pg 78/79 of "The Art of the Infinite" by Robert and Ellen Kaplan. It's based on an ancient trick for summing a series of numbers without having to go through all the middle-work For example if start = 1, stop = 5, and difference = 3 sumint = 1 + 4 + 7 + 13 = 35 There's also the famous "schoolboy in class exercise": Sum 1 to 100 start = 1, stop = 100, difference = 1 sumint = 1 + 2 + 3 + 4 + .... + 100 = 5050 """ sumint = (stop * (2 * start + (stop - 1) * difference)/2) return sumint # ---------- readDataSetsdsv Functions ---------- # These functions are read a csv file, and put it in a xayastats dataset form # May need further modification to put in XAYA format. def getDataset(filepath = ""): """ Reads a comma separated variables (csv) file and creates a dataset. A dataset has the following components: Variables, Data, Types, Groups, Sorts To call each componenet of a dataset, do the following. Say you have a dataset named: somedata somedata['VARIABLES'] accesses the variables in this dataset somedata['DATA'] accesses the rows of data in this dataet. """ data = {} data['VARIABLES'] = getVariables(filepath) data['DATA'] = getData(filepath) return data # Notes for version with type support added # def getDataset(filepath = "", varGraph = {}, typesGraph = {}): # """ Given a filepath, create a data table and a variables table """ # # data = {} # if varGraph == {}: # data['VARIABLES'] = getVariables(filepath) # else: # data['VARIABLES'] = varGraph # if typesGraph == {}: # data['VARTYPES'] = typesGraph # else: data['VARTYPES'] = guessTypes # data['DATA'] = getData(filepath) # return data def saveDataset(dataset = {}, filepath = ""): """ Reads a XAYA format dataset into a csv file where the first row contains the file headings and all other rows contain the data. Inverse of getDataset. Algorithm: 1. 'DATA' component of dataset translated into a list of strings -- transList 2. Write transList to a file object """ # 'DATA' component of dataset translated into a list of strings -- transList (see xayacore transGraphToList) transList = [] for key in dataset['DATA']: valueString = " ".join([str(x) + ',' for x in dataset['DATA'][key]]).rstrip(',') newline = '\n' finalString = valueString + newline transList.append(finalString) # Write transList to a file object (see xayacore writeGraph) fileObject = open(filepath, 'a+') fileObject.writelines(transList) fileObject.flush() return fileObject def htmlDataset(dataset = {}, title=""): """ Utility function to generate HTML Table from a Dataset""" content = "" for row in dataset['DATA']: content += "" for col in dataset['DATA'][row]: if row==0: content += "" content += "" content += "
" + title + "
" else: content += "" content += col if row==0: content += "" else: content += "
" return content def saveHtmlDataset(dataset = {},title="", filepath = ""): """ Saves a dataset in HTML format to a file, for easy viewing in a web browser """ html = htmlDataset(dataset, title) fileObject = open(filepath, 'a+') fileObject.write(html) fileObject.flush() return fileObject def viewData(dataset): """ Given a dataset, print it out in reader friendly format """ return pprint.pprint(dataset['DATA']) def viewVariables(dataset): """ Given a dataset, print out the heading fields in reader friendly format """ for entry in dataset['DATA'][0]: pprint.pprint(str(entry) + " : " + str(dataset['VARIABLES'][str(entry)])) def saveData(dataset={}, filepath= 'defaultFilePath'): """ Saves the 'DATA' component of a dataset as an ascii file in XAYAcore graph format. """ return xayacore.writeGraph(dataset['DATA'], filepath) def saveVariables(dataset={}, filepath= 'defaultFilePath'): """ Saves the 'VARIABLES' component of a dataset in an ascci file in XAYAcore graph format """ return xayacore.writeGraph(dataset['VARIABLES'], filepath) def readFile(filepath =""): """ Reads file object into a dictionary keyed to line number """ try: fileobj = open(filepath, 'r') except IOError: print "File not opened, could not find", filepath else: # print "-------------------------------" # print # print "Opening file", filepath -- Disabled as adds to much "gibberish" to interpreter session when opening lost of files at once datatable = {} autonumberkey = -1 for line in fileobj: autonumberkey = autonumberkey + 1 datatable[autonumberkey] = line fileobj.close() return datatable # def writeFile(filepath=" "): def getData(filepath =" "): """ Cleans up readFile dictionary so it can function as a relation tbl like dataset """ data = readFile(filepath) dataGraph = {} for key in data.keys(): dataGraph[key] = [column.strip() for column in data[key][:-1].split(",")] # strip newline at end return dataGraph def getVariables(filepath =""): """ Extracts a dictionary of VariableName: Index pairs from getData """ data = getData(filepath) variables = {} for index in range (len(data[0])): variables[data[0][index]] = [index] return variables # ---------- Sorting and Grouping Functions ---------- def getSorts(dataset = {}, sortVars = [], sortOrder = 'ASC'): """ Adds a SORTS component to a XAYA format dataset. Given a XAYA format dataset and a list of variables to sort on, returns the same dataset, with the addition of the SORTS component. The SORTS component is a list of keys in dataset['DATA'] in sort order. The default sort order is ascending ('ASC'). For descending sort order, enter 'DESC' Algorithm: # Extract the 'DATA' component of dataset to sort # Extract the index positions of variablse we wish to sort DATA by # Create a dictionary whose values include only the data we wish to sort by # Create a list of tuples where: # The first tuple element (0) is a list of the data to sort by # The second tuple element (1) is the dictionary key. # Data headings from this list of tuples (tuple 0) have been removed # Sort the tuples based on the selected variables: # If sort order is 'ASC' -- sort ascending # if sort order is 'DESC' -- sort descending # Create a new dataset # Copy the original dataset to the new dataset # Add a new 'SORTS' component to the new dataset # Return the new dataset with 'SORTS' added. Note: One can later reverse the order in which data is displayed by reversing the 'SORTS' component. So consider using the default sort-order 'ASC', unless there is a compelling reason why the natural sort for this data would be in descending order, 'DESC' """ sortdata = {} # Set up a dictionary for data that will be the basis of the sort newdataset = {} # set up a new dataset that will include the sort order data # Extract the 'DATA' component of dataset to sort data = dataset['DATA'] variables = dataset['VARIABLES'] # Extract the index positions of variables we wish to sort DATA by varindexlist = [] for vars in sortVars: varindexlist.append(variables[vars][0]) keylist = [] # list that will hold keys in the final sort order #Create a dictionary whose values include only the data we wish to sort by for row in data: rowlist = [] for var in varindexlist: rowlist.append(data[row][var]) sortdata[row] = rowlist # Create a list of tuples where: # The first tuple element (0) is a list of the data to sort by # The second tuple element (1) is the dictionary key. # Data headings from this list of tuples (tuple 0) have been removed sortdatalist = [(values, row) for row, values in sortdata.items()] sortdatalist.pop(0) # Sort the tuples based on the selected variables: sortedlist = sortlist(sortdatalist) for tuple in sortedlist: keylist.append(tuple[1]) # Extract keys in ascending sort order if not sortOrder == 'ASC': keylist.reverse() # Extract keys in descending sort order # Create a new dataset that contains the original data and the new SORTS component. for key in dataset.keys(): newdataset[key] = dataset[key] newdataset['SORTS'] = keylist return newdataset def sortedDataset(dataset={}, sortOrder='ASC'): """Sorts a Dataset in Variable Order """ # varitems = dataset['VARIABLES'].items()# Tuples with variable key, and variable index no varlist = [ (x, y) for y, x in dataset['VARIABLES'].items()] varlist.sort() variables = [] for tuple in varlist: variables.append(tuple[1]) sortlist = getSorts(dataset, variables,sortOrder)['SORTS'] sorteddataset = {} sorteddataset['DATA'] ={} sorteddataset['DATA'][0] = dataset['DATA'][0] autonumberkey = 0 for item in sortlist: autonumberkey = autonumberkey + 1 sorteddataset['DATA'][autonumberkey] = dataset['DATA'][item] for key in dataset.keys(): if key != 'DATA': sorteddataset[key] = dataset[key] return sorteddataset def getGroups( dataset = {}, groupVars = []): """ Adds a GROUPS component to a XAYA format dataset. Given a XAYA format dataset and a list of variables to group on, returns the same dataset, with the addition of the GROUPS component. The GROUPS component contains a dictionary where keysrepresen unique groups, and values are the list of rows (keys) in dataset['DATA'] that are members of that group Algorithm: # Extract the 'DATA' component of dataset to group # Extract the index positions of variables we wish to group DATA by # Create a dictionary whose values include only the data we wish to group by # Get a list of all values to aggregate into groups # Create the set of unique groups # Assign rows of data to a particular group """ groupdata = {} # Set up a dictionary for data that will be the basis of the grouping newdataset = {} # set up a new dataset that will include the sort order data # Extract the 'DATA' component of dataset to group data = dataset['DATA'] copydata = data.copy() # Make a shallow copy data -- so 'del' in next statement leaves original data untouched del copydata[0] # Get rid of the headings row in copydata -- leaves data untouched variables = dataset['VARIABLES'] # Extract the index positions of variables we wish to group DATA by varindexlist = [] for vars in groupVars: varindexlist.append(variables[vars][0]) # Create a dictionary whose values include only the data we wish to group by for row in copydata: rowlist = [] for var in varindexlist: rowlist.append(copydata[row][var]) groupdata[row] = rowlist # Get a list of all values to aggregate into groups groupslist = groupdata.values() # Result is a list of lists -- we want a list of tuples, so can use the groups as keys in a dictionary # Convert groups from lists to tuples grouptuples = [tuple(group) for group in groupslist] # Create the set of unique groups groupset = set(grouptuples) # Initialize a dictionary to hold the groups groupsdict = {} for group in groupset: groupsdict[group] = [ ] # Assign rows of data to a particular group for row in groupdata: for member in groupset: if tuple(groupdata[row]) == member: groupsdict[member].append(row) # Create a new dataset that contains the original data and the new GROUPS component. for key in dataset.keys(): newdataset[key] = dataset[key] newdataset['GROUPS'] = groupsdict return newdataset def getTypes(dataset = {}, typesGraph = {}): """ Adds a TYPES component to a dataset, and casts the data based on the typesGraph (a XAYA format graph) whose nodes are the variable names and whose edges are the data type for that variable. Default data type is string. Data of types 'CATEGORICAL' and 'TEXT', remains as string. Data of type 'INTEGER' is cast to via the python fn int(). Data of type 'NUMERICAL' is cast via the python fn float() Algorithm: # Check that variables in dataset match variables in typesGraph # Check that there are no unknown types in typesGraph # Extract a sorted list of the index positions and names of each variable # Create a dictionary with data values cast to appropriate types # 'TEXT' and 'CATEGORICAL' are left as is # 'INTEGER' is cast via int() fn # 'NUMERICAL' is cast via float() fn # Create a new dataset that contains the original data and the new TYPES component. """ # Set up of variables needed later in function data = dataset['DATA'] copydata = data.copy() # Make a shallow copy data -- so 'del' in next statement leaves original data untouched del copydata[0] # Get rid of the headings row in copydata -- leaves data untouched variables = dataset['VARIABLES'] vars = variables.keys() typevars = typesGraph.keys() typedata = {} # Set up a dictionary for data that is cast to type newdataset = {}# set up a new dataset that will include the Types component tocast = ['INTEGER', 'NUMERICAL'] tonotcast = ['TEXT', 'CATEGORICAL'] alltypes = ['INTEGER', 'NUMERICAL','TEXT', 'CATEGORICAL'] # Check that variables in dataset match variables in typesGraph for variable in vars: if variable not in typevars: print 'There is a mismatch between Dataset Variables and the typesGraph' return newdataset # Check thatthere are no unknown types in the typesGraph typevalues = [value[0] for value in typesGraph.values()] for type in typevalues: if type not in alltypes: print 'There is an unknown data type in typesGraph' return newdataset # Extract a sorted list of the index positions and names of each variable varitemslist = [(value[0], key) for key, value in variables.items()] varitemslist.sort() # Create a dictionary with data values cast to appropriate types typedata[0] = data[0] for row in copydata: rowlist = [] for varitem in varitemslist: castitem = data[row][varitem[0]] typeitem = typesGraph[varitem[1]][0] if typeitem in tonotcast: # Leave 'TEXT' and 'CATEGORICAL' as is rowlist.append(castitem) elif typeitem == 'INTEGER': rowlist.append(int(float(castitem))) # Need to cast to float first for things like '1839.00' else: rowlist.append(float(castitem)) typedata[row] = rowlist # Create a new dataset that contains the original data and the new TYPES component. for key in dataset.keys(): if key != 'DATA': newdataset[key] = dataset[key] newdataset['TYPES'] = typesGraph newdataset['DATA'] = typedata return newdataset def filterDataset(dataset = {}, filterVar = '', filterValues = []): """ Filters a XAYA format dataset on a single variable for a list of filter values. Given a XAYA format dataset, a filterVar and a list of filterValues, this function returns a dataset containing only those rows that contain the filter values. If you wish to filter on multiple variables -- you can run this function several times. Examples: >>> filteredbooks1 = xayastats.filterDataset(books, 'PositiveReviews', [250]) # returns only those rows from books dataset which have 250 positive reviews >>> filteredbooks2 = xayastats.filterDataset(books, 'Author', ['Mishtu', 'Ian']) # returns only those rows from books dataset whose authors are either Ian or Mishtu # Note -- be if the dataset is cast to datatypes, you must have the correct datatype in your list. For example, 250 is not the same as '250' -- one is an integer and the other is a string Algorithm: # Extract the 'DATA' component of dataset to filter # Extract the index positions of variable we wish to filter by # Build a filtered dataset" By looping through data rows and checking the variable (at index position) And adding only those rows whose value is in the filterValues list """ # Set up of variables needed later in function data = dataset['DATA'] datarows = data.keys()[1:] variables = dataset['VARIABLES'] varIndex = variables[filterVar][0] vars = variables.keys() filterdataset = {}# set up a filter dataset filterdataset['DATA'] = {} filterdataset['DATA'][0] = data[0] for row in datarows: if data[row][varIndex] in filterValues: filterdataset['DATA'][row] = data[row] # Create a new dataset that contains the original data and the new TYPES component. for key in dataset.keys(): if key != 'DATA': filterdataset[key] = dataset[key] return filterdataset # ---------- Data Extraction and Type-Casting Functions ---------- def getDatalist (dataGraph = {}, varGraph = {}, variable = "" ): """ Produces a data list for the via dataset and variables dictionary for a named variable""" # List needs to be converted if data is to be treated as Integer or Numerical (floats) # Default is to treat data as Categorical data = dataGraph vars = varGraph var = variable datalist = [] """ Extracts a Variable Column as a datalist, using dataset and column dictionaries""" keys = data.keys() keys.remove(0) for key in keys: datalist.append(data[key][varGraph[var][0]]) return datalist def castDatalistToNumerical(list = []): """ Converts a list to float number values """ datalist = list numberlist = [] for item in datalist: numberlist.append(float(item)) return numberlist def getNumberslist(dataGraph = {}, varGraph = {}, variable = ""): return castDatalistToNumerical(getDatalist(dataGraph, varGraph, variable)) def castDatalistToInteger(list = []): """ Converts a list to float number values """ datalist = list numberlist = [] for item in datalist: numberlist.append(int(item)) return numberlist def getIntegerslist(dataGraph = {}, varGraph = {}, variable = ""): return castDatalistToInteger(getDatalist(dataGraph, varGraph, variable)) def testTypes(item = None): """ A set of tests to distinguish between the various XAYAstats data types: TEXT, CATEGORICAL, INTEGER, NUMERICAL Definition: Item is None or empty string, there is no data Items must be either strings or numbers according to Python letters are not numbers text has white space integers do not straggle over the decimal line """ # Default case -- item is None, or empty string returns no data if item == None or item == "": return 'NODATA' else: # Items must be either strings or numbers according to Python try: float(item) except TypeError: print "Item is not one of the following: TEXT, CATEGORICAL, INTEGER, NUMERICAL" return 'UNDEFINED' # Letters are not numbers except ValueError: # text has white space if list(item).count(' ') == 0: return 'CATEGORICAL' else: return 'TEXT' # integers do not straggle over the decimal line if int(float(item)) == float(item): # int(float(item) prevents ValueError for items like '24.0' return 'INTEGER' else: return 'NUMERICAL' # ---------- genStats Functions ---------- def getSummarystats(dataset = {}, variable = ""): """ Calculates a set of basic statistics on variable in a dataset and returns results as a dictionary of summary statistics. Note results are for viewing ease, but are not in XAYAstats dataset format. """ data = getNumberslist(dataGraph = dataset['DATA'], varGraph = dataset['VARIABLES'], variable = variable) return summarystat(data) def summarizedDataset(dataset={}): """Summarizes a dataset. Stage 1: Summarizes numerical variables Stage 2: Summarizes numerical variables, grouped by categorical variables. """ # Get Variables into a list, sorted by variable position varlist = [ (x, y) for y, x in dataset['VARIABLES'].items()] varlist.sort() variables = [] for tuple in varlist: variables.append(tuple[1]) # Loop through variables summarydata = {} summarydata['DATA'] = {} summarydata['DATA'][0] = ["Variable", "Min", "Max", "Mean", "Median", "Range", "SD", "CV"] summarydata['VARIABLES'] = {"Variable" : [0], "Min": [1], "Max": [2], "Mean": [3], "Median": [4], "Range": [5], "SD": [6], "CV": [7]} autonumberkey = 0 for var in variables: # Test that variables are numerical datalist = getDatalist(dataset['DATA'],dataset['VARIABLES'], var) datalist.sort() try: float(datalist[0]) except ValueError: continue try: float(datalist[-1]) except ValueError: continue # If numerical, calculate summary stats autonumberkey = autonumberkey + 1 summarygraph = getSummarystats(dataset,var) # Move data from individual graphs to dataset format summarydatalist = [] summarydatalist.append(var) summarydatalist.append(summarygraph['min'][0]) summarydatalist.append(summarygraph['max'][0]) summarydatalist.append(summarygraph['mean'][0]) summarydatalist.append(summarygraph['median'][0]) summarydatalist.append(summarygraph['range'][0]) summarydatalist.append(summarygraph['sd'][0]) summarydatalist.append(summarygraph['cv'][0]) summarydata['DATA'][autonumberkey] = summarydatalist return summarydata def getHistograph(dataset = {}, variable = ""): """ Calculates a histogram-like summary on a variable in a dataset and returns a dictionary. The keys in the dictionary are unique items for the selected variable. The values of each dictionary key, is the number of times the unique item occured in the data set """ data = getDatalist(dataGraph = dataset['DATA'], varGraph = dataset['VARIABLES'], variable = variable) return histograph(data) def sortHistograph(histograph = {}, sortOrder = 'DESC'): """ Sorts an histograph to make it easier to view. The default is descending (DESC) order (most to least frequent). Enter 'ASC' to produce an histogram in ascending order (least to most frequent). """ sortedHist = [] sortedHist = sortlist([(value,key)for key,value in histograph.items()]) if not sortedHist == 'ASC': sortedHist.reverse() return sortedHist def minstat(list = [ ]): """ Returns minimum value in a list or tuple""" return min(list) def maxstat(list = [ ]): """ Returns minimum value in a list or tuple""" return max(list) def rangestat(list = [ ]): """ Returns the absolute range of values in a list or tuple""" return abs(maxstat(list) - minstat(list)) def sumstat(*L): """ Sums a list or a tuple L Modified from pg 80 of Web Programming in Python """ if len(L) == 1 and \ ( isinstance(L[0],types.ListType) or \ isinstance (L[0], types.TupleType) ) : L = L[0] s = 0.0 for k in L: s = s + k return s # Consider changing so uses the much simpler built-in sum() fn. def sumsqstat(*L): """ Sum of squares for a list or a tuple L""" if len(L) == 1 and \ ( isinstance(L[0],types.ListType) or \ isinstance (L[0], types.TupleType) ) : L = L[0] s = 0.0 for k in L: sq = k * k # Calculates the Square s = s + sq # Sums the Square return s # Consider changing so uses the much simpler built-in sum() fn def meanstat(n): """ Calculates mean for a list or tuple """ mean = sumstat(n)/float(len(n)) return mean def medianstat(n): """ Calculates median for a list or tuple Modified from pg 347 Zelle's CS intro book, Python Programming, an Introduction to Computer Science """ s = n[:] s.sort() size = len (n) midPos = size/2 if size % 2 == 0: median = (s[midPos] + s[midPos - 1]) /2.0 else: median = s[midPos] return median def dlist(list = [ ]): """ Calculates a list of Deviations from Mean""" diffs = [ ] mean = meanstat(list) for k in list: diff = k - mean diffs.append(diff) return diffs def varstat(list = [ ]): """ Calculates the Variance for a list or tuple L""" devs = dlist(list) sumdevs = sumsqstat(devs) var = sumdevs/(len(list)-1.0) return var def sdstat(list = [ ]): """ Calculates the Variance for a list or tuple """ sd = sqrt(varstat(list)) return sd def cvstat(list = [ ]): """ Calculates the Coefficient of Variation""" cv = sdstat(list)/meanstat(list) * 100 return cv def summarystat(list = [ ]): """Summary Statistics for a List or Tuple""" statsGraph = {} min = minstat(list) max = maxstat(list) range = rangestat(list) mean = meanstat(list) median = medianstat(list) sd = sdstat(list) cv = cvstat(list) statsGraph['min'] = [min] statsGraph['max'] = [max] statsGraph['range'] = [range] statsGraph['mean'] = [mean] statsGraph['median'] = [median] statsGraph['sd'] = [sd] statsGraph['cv'] = [cv] return statsGraph #Alternate summarystat implementation (as Xaya format graph) # valuelist = [min,max,mean,median,sd,cv] # keytuple = ('min','max','mean','median','sd','cv') # for key in range(len(keytuple)): # statsGraph[key] = valuelist[key] # return statsGraph # screwing up on indices and tuple list dict diffs def zlist(list = [ ]): """Calculates z scores for a list or tuple""" zeds = [] mean = meanstat(list) sd = sdstat(list) for k in list: zed = (k - mean)/sd zeds.append(zed) return zeds def rlist( list = [ ]): """Calculates Range scores with values between 0 and 1""" rrs = [] min = minstat(list) range = rangestat(list) for k in list: r = (k - min)/float(range) rrs.append(r) return rrs def histograph(list = [ ]): """ A histograph is a histogram where the unique items in a list are keys, and the counts of each item is the value for that key. The format is in XAYA graph format (values are lists) Definition: Get the set of distinct items in a list Count the number of times each item appears """ histoGraph = {} # Get the set of distinct items in a list distinctItems = set(list) # Count the number of times each item appears for item in distinctItems: histoGraph[item] = [list.count(item)] return histoGraph # Add assertions # assert meanstat(zscores) == 0 # assert sdstat(zscores) == 1.0 # assert minstat(rscores) == 0.0 # assert maxstat(rscores) == 1.0