Alan Calvitti PhD, EECS
Statigrafix aims to provide researchers and investigators in a variety of application domains, aka domain experts, with
exploratory data analysis services via rapid-prototype development of data visualization tools and close
collaboration to aid hypothesis formation, aka knowledge discovery, and communication of results, via production of
figures and posters.
Strategy for Knowledge Discovery @ Statigrafix
The nascent Analytics industry, for which San Diego is
a hub, tends to focus on large-scale software development of EDA and CDA applications typically for large to very large electronic data sets.
Because of the large footprint of these applications, they are ill-suited to help individual investigators or small teams of reserarchers with
their EDA needs. In this context, it's worth pointing out that the winner of the $1,000,000 Netflix prize to improve the accuracy of
their movie reccomendation system via machine learning methods did not even reach the 10% improvement benchmark originally stipulated.
In contrast, Statigrafix focuses on the needs of individual investigators and small teams to gain insight from data via a complementary
approach, summarized as follows.
Exploratory Data Analysis. First distinguished from Confirmatory Data Analysis (CDA) as a separate
statistical discipline by John Tukey (1915-2000), the father of 21st century
graphical dispays, EDA focuses on hypothesis formation and knowledge discovery in datasets and is complementary to CDA.
Data Visualization. Statigrafix's core service is rapid-prototype development of
Data Visualization (VIZ) tools enabled by the flexibility of
Mathematica (wolfram.com). VIZ tools include relativey simple templates such as
scatterplot matrices but also custom structured graphics such as sophisticated timeline or calendar-based plots for
time-series data. A key precept of VIZ is to give domain experts first an overview of the data, followed by additional
filtering and detail on demand. VIZ allows the human visual system to effectively identify informative events,
trends and patterns in datasets. As Howard wainer points out in his text "Graphical Discovery," Two centuries have passed since Playfair's
pioneering efforts following the idea that a graph can tell us thing easily that might not have been seen otherwise.
Robustness to missing values. Real-world datasets contain varying degrees of missing values. Although missing values
are conceptually distinct from existing data. A typical step in CDA is to impute missing values. In contrast, VIZ approaches don't
require imputation: it is often possible to structure graphical elements in such a way that patterns in existing data
Iterative Exploration. EDA seems most effective when structured as iterated dialague between
domain expert and analyst. Each iteration can be broken down into stages. A typical sequence comprises: 1. Data cleaning and aggregation.
2. design and programming of graphics, possibly annotated. 3. Joint exploration of patterns with domain experts. This typically
results in insight into the data and suggestions for subsequent exploration and request for additional details.
The idea for Statigrafix emerged from my experience as a Postdoctoral Fellow working on a portfolio of data analysis
projects at the UCSD School of Medicine, VA San Diego Healthcare System and the California Institute for Telecommunications and Information
Technology. The application domains ranged from healthcare processes, medical informatics, biomedicine and telecom - all examples of