[Gambas-user] Another Best Approach Question: Statistics over an array of objects

Tobias Boege taboege at ...626...
Thu Aug 8 12:51:28 CEST 2013


On Thu, 08 Aug 2013, Bruce wrote:
> On Wed, 2013-08-07 at 23:56 +0930, Bruce wrote:
> > On Wed, 2013-08-07 at 15:58 +0200, Tobias Boege wrote:
> > > On Wed, 07 Aug 2013, Bruce wrote:
> > > > I'm looking for good ideas again I'm afraid.
> > > > 
> > > > 
> > > > I have an array of objects that can be best described as a set of
> > > > categories with an associated value.  Something along the lines of 
> > > > [Cat1:String, Cat2:String, Cat3:String, Value:Float].
> > > > 
> > > > The general idea is that the user would select one of the categories and
> > > > then the project would calculate a set of statistics across that
> > > > category.  For example the categories could be "Age", "Sex", "Height"
> > > > and the Value may be, say, weight.
> > > > 
> > > > What I am trying to do is develop a generic module/class accepting the
> > > > input array that will return another array of objects being the
> > > > statistical analysis of the input array across the specified category.
> > > > The statics are fairly basic (at this stage) being the average for each
> > > > category, the sample standard deviation and the sample standard error.
> > > > 
> > > > Generally the input array length is reasonably short, ~30 to ~300 items.
> > > > Also the category domains are demonstrably short, between 3 and ~10
> > > > identities.
> > > > 
> > > > I could (and have done) use the database (postgresql) statistics
> > > > functions and re-query the entire dataset given the user category
> > > > selection.  However, the time to execute this is unacceptably slow (the
> > > > full dataset is over 3,000,000 rows). Furthermore, I would have to
> > > > devise a generic way to build the specific query required each time.
> > > > 
> > > > Another way could be to develop an interface to r or something but I am
> > > > hesitant to embark on that path given my knowledge of stats libraries
> > > > like that.
> > > > 
> > > > So, just looking for a "good idea".
> > > > 
> > > 
> > > Sorry, I don't understand... You want to give a Variant[] to a class, like:
> > > 
> > > Public Sub btnGiveStats_Click()
> > >   Dim hStats As Stats
> > > 
> > >   hStats = Stats.Give(["Age", "Sex", "Height", fWeight])
> > > End
> > > 
> > > Right?
> > > 
> > > What is this array supposed to signify? On what data shall it operate? I
> > > mean: is there a table of persons (with fields Age, Sex, Height, ...) and
> > > the funtion shall count ... something? Is the "weight" used to weigh the
> > > average figure?
> > > 
> > > Thinking about it further, I admit that I don't understand anything... at
> > > all. :-)
> > > 
> > > Regards,
> > > Tobi
> > 
> > OK, rephrased.
> > 
> > I'm looking for good ideas to create a generic statistics module with a
> > function:
> >  Analyse(category as Integer, data_array as <someclass>[]) as Analysis[]
> > 
> > <someclass>[] is an array of objects, these objects consist of an
> > unknown number of category properties and a value property.  Analysis is
> > a class that exhibit some basic statistics of "value" across the specified
> > "category".
> > 
> > In short, Analysis(category,data_array) is returning a kind of a crosstab of the value against the selected category.
> > So we could get a user directive to anlayse "Weight" (the value) across "Sex" (the category) and the returned array would be
> > [{"2years",12.3432, 1.123, 0.34}, {"3years", 14.1643,1.112,0.01},{"4years",16.954,2.001,0.13}, etc]
> > where the {} contents are the properties of the Analysis class, viz Category,Average,StdDev,StdErr.
> > 
> > The question is whether it would be to write statistical analysis routines from
> > scratch or is there a better (or easier) way using either 
> > a) "known" libraries, or

I don't know any... I'd write this as a Gambas module out of pure naivity.

> > b) developing a set of generic methods to use the underlying database stats functions
> > c) a published gambas component?
> > 
> > regards
> > Bruce
> > 
> Oops, I meant
> > In short, Analysis(category,data_array) is returning a kind of a 
> > crosstab of the value against the selected category.
> > So we could get a user directive to anlayse "Weight" (the value)
> > across "AGE" (the category) and the returned array would be
> > [{"2years",12.3432, 1.123, 0.34}, {"3years",
> > 14.1643,1.112,0.01},{"4years",16.954,2.001,0.13}, etc]
> > where the {} contents are the properties of the Analysis class, viz
> > Category,Average,StdDev,StdErr.
> B

In my ears, this sounds more like an introspection problem than a
mathematical one, right?

For a given category (age), there are multiple properties in each object (I
assume that the property name is the same across all objects, though) which
contain a value.

The algorithm would be, IIUC:

1. Ask one of the objects to give you the names of all properties belonging
   to the category.
2. Enumerate all of these property name strings (sProp)
2.1. For Each object In data_array, get the value of the current property
2.2. Do the math

As a function it would look like:

Public Struct Analysis
  Name As String
  Average As Float
  StdDev As Float
  StdErr As Float
End Struct

Public Sub Analysis(iCat As Integer, aObjs As Object[]) As Analysis[]
  Dim aResult As New Analysis[]
  Dim hAnalysis As Analysis
  Dim sProp As String
  Dim hObject As Object
  Dim iValue As Integer

  ' Get the names of all properties in the objects which are associated with
  ' the given category (as a String[])
  For Each sProp In aObjs[0].AssociatedProperties(iCat)
    hAnalysis = New Analysis
    hAnalysis.Name = sProp

    For Each hObject In aObjs
      iValue = Object.GetProperty(hObject, sProp)
      ' Do the math
    Next
    aResult.Add(hAnalysis)
  Next
  Return hAnalysis
End

The difficult and most maintainance-burdened part is the
AssociatedProperties() function: each class which you want to analyse has to
implement it. I think of it like:

' This is SomeClass.class

Property 2years As Integer
Property 3years As Integer
Property 4years As Integer

Public Function AssociatedProperties(iCat As Integer) As String[]
  Select Case iCat
    Case CategoryAge
      Return ["2years", "3years", "4years"]
    Case CategorySex
      Return ...
    Case Category...
      Return ...
  End Select
End

Of course, it would be much easier if you followed a specific pattern, i.e.
if you named all "age" properties like "Xyears" you can iterate over all
symbols in a class and dynamically find the property names:

Property 2years As Integer
Property 3years As Integer
Property 4years As Integer

Public Function AssociatedProperties(iCat As Integer) As String[]
  Dim aProps As New String[]
  Dim sSym As String

  For Each sSym In Object.Class(Me).Symbols
    If Object.Class(Me)[sSym].Kind = Class.Property Then
      Select Case iCat
        Case CategoryAge
          If sSym Ends "years" Then aProps.Add(sSym)
	Case CategorySex
	  ...
	Case ...
	  ...
      End Select
    Endif
  Next
  Return aProps
End

I hope this mail isn't overkill and helps you a bit further :-)

Regards,
Tobi




More information about the User mailing list