So back when this thread was active, I mentioned that I was concerned that Ball & Newman's attempt to measure status was overfitting the data .  Eric replied:

The only way to really know whether or not you're doing it is to have
some data held out for testing/cross-validation. For example, you
could erase a random 10% of the edges, do their procedure on the
remaining 90%, and then see how many rank violations you get in the
invisible 10% of edges.

Say cross-validation shows that they are overfitting.  How hard is it to take the next step and change the algorithm so it isn't overfitting any more?  




On Thu, Oct 11, 2012 at 5:45 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
Updated to add more details.

On Thu, Oct 11, 2012 at 12:29 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
> Is anyone interested in working on the sampling algorithm I have been
> writing about? If not, I may finish it myself, but I don't think I'll
> have time for a week or so. If people are interested but have
> questions, you can email me off-list if you want?
>
> On Wed, Oct 10, 2012 at 10:28 PM, Michael Bishop <michaelbish@gmail.com> wrote:
>> Thanks for the link to the kaggle competition Kevin.  I'll have some fun
>> looking at that.  Of course the approach Ball and Newman are taking is only
>> one of many. Check out http://statnet.org/ for an R package which can do
>> some fancy things... I think the heavy lifting implemented in C.
>>
>>
>>
>> On Wed, Oct 10, 2012 at 10:12 PM, Kevin Wilson <khwilson@gmail.com> wrote:
>>>
>>> Just to clarify, the phenomenon I meant is very well-known is the
>>> directedness of many kinds of social graphs. This particular example of
>>> friendships among school children and the social science implications are
>>> rather nice.
>>>
>>> Kevin
>>>
>>>
>>> On Wed, Oct 10, 2012 at 10:36 PM, Kevin Wilson <khwilson@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm Kevin. I met Eric when he came to visit Trevor at Knewton. Trevor
>>>> invited me along to this list.
>>>>
>>>> Anyway, the discussion is interesting, and I would point out that the
>>>> phenomenon observed by this paper is very well-known. See, for instance,
>>>> this Kaggle competition and the ensuing discussion:
>>>> http://www.kaggle.com/c/FacebookRecruiting
>>>>
>>>> One interesting question beyond this paper and in the direction of where
>>>> the discussion has been going would be to ask the same questions in larger
>>>> and larger social networks. Some example questions:
>>>>
>>>> Most edges run in the direction of status. But how many triangles are
>>>> there? i.e., suppose VLS has very low status, LS has low status and HS has
>>>> high status. A triangle would consist of a relationship with VLS claiming
>>>> friendship with LS and HS and VLS claiming friendship with HS. One may
>>>> expand from triangles to longer chains (where you ask about n-length chains
>>>> from VLS to Lady Gaga next to a direct edge from VLS to Lady Gaga).
>>>> Can you distinguish types of social networks by the length of these
>>>> chains?
>>>> What happens when you merge two friendship networks? Do relative statuses
>>>> remain the same?
>>>>
>>>> Each of these would be a paper, but you'd probably have to vastly improve
>>>> the code. I think data sets exist to answer all three.
>>>>
>>>> Kevin
>>>>
>>>> On Wed, Oct 10, 2012 at 1:02 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
>>>>>
>>>>> Here is a writeup of some notes related to the tricky algorithms
>>>>> question I was trying to pose. Hopefully it will be helpful to anyone
>>>>> who is thinking about reimplementing the code from the paper.
>>>>>
>>>>> On Sun, Oct 7, 2012 at 2:27 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
>>>>> > GAH, left out a minus sign:
>>>>> >
>>>>> > beta = e^[ - (j-i)^2 ]
>>>>> >
>>>>> > On Sun, Oct 7, 2012 at 2:26 PM, Eric Purdy <epurdy@uchicago.edu>
>>>>> > wrote:
>>>>> >> Or you can also skip step 2, if you initialize alpha and beta to be
>>>>> >> something simple and reasonable, like
>>>>> >>
>>>>> >> (unreciprocated probability of edge from #i to #j)
>>>>> >> alpha = e^[ j-i ]
>>>>> >>
>>>>> >> (reciprocated probability of edge between #i and #j)
>>>>> >> beta = e^[ (j-i)^2 ]
>>>>> >>
>>>>> >> (I might have alpha and beta backwards from how they are in the
>>>>> >> paper?)
>>>>> >>
>>>>> >> Before I was saying that alpha and beta should be modeled as
>>>>> >> polynomials, but it probably makes more sense to model (log alpha)
>>>>> >> and
>>>>> >> (log beta) as polynomials, as the equations above do. That makes it
>>>>> >> more like logistic regression, which I think is supposed to work well
>>>>> >> in the social sciences?
>>>>> >>
>>>>> >> On Sun, Oct 7, 2012 at 12:35 PM, Eric Purdy <epurdy@uchicago.edu>
>>>>> >> wrote:
>>>>> >>> So, I'm still a bit too busy to just reimplement their code, but
>>>>> >>> here's a sketch of roughly how I would do it. Maybe someone else
>>>>> >>> will
>>>>> >>> feel like writing this code? It sounds like we could probably use it
>>>>> >>> to write a publishable paper, if that's a motivation.
>>>>> >>>
>>>>> >>> The easiest would be to do this in python with networkx.
>>>>> >>>
>>>>> >>> 1. Write code to separate the edges into training and test sets.
>>>>> >>> Basically, independently label each edge "test" with probability 10%
>>>>> >>> and "training" with probability 90%. Then make a new graph that
>>>>> >>> doesn't have any of the test edges. This is the graph that you will
>>>>> >>> run EM on.
>>>>> >>>
>>>>> >>> 2. Get a ranking that has the minimum number of rank violations.
>>>>> >>>
>>>>> >>> 3. Write code to infer alpha and beta from a set of rankings. This
>>>>> >>> is
>>>>> >>> basically just a least squares problem, if you take my advice and
>>>>> >>> use
>>>>> >>> polynomials to represent alpha and beta. You will probably want to
>>>>> >>> have some sort of penalty for having large coefficients in your
>>>>> >>> polynomial. A common way is to put a prior on the polynomial a_0 +
>>>>> >>> a_1
>>>>> >>> x + ... + a_k x^k. A sensible prior would be a_1^2 + ... + a_k^2.
>>>>> >>> Note
>>>>> >>> that you probably want to let a_0 be whatever it wants.
>>>>> >>>
>>>>> >>> 4. Write code to sample from the set of all rankings, assuming that
>>>>> >>> alpha and beta are the correct distributions. This is the thing that
>>>>> >>> can probably be done quickly if you do some clever algorithms.
>>>>> >>> Basically, you generate a ranking of the first k nodes, then you
>>>>> >>> extend it to a ranking of the first (k+1) nodes by inserting node
>>>>> >>> #(k+1) somewhere in between two other nodes in the ranking.
>>>>> >>>
>>>>> >>> 5. Write code that does EM iterations. This is just going back and
>>>>> >>> forth between steps 3 and 4. In step 4, generate a bunch of random
>>>>> >>> rankings, and then use all of them when you go back to step 3. Save
>>>>> >>> all of the rankings, like Mike says.
>>>>> >>>
>>>>> >>> 6. Write code that evaluates your proposed ranking by looking at the
>>>>> >>> invisible test edges. See how the quality of the ranking changes
>>>>> >>> over
>>>>> >>> each EM iteration.
>>>>> >>>
>>>>> >>> On Sun, Oct 7, 2012 at 12:16 PM, Michael Bishop
>>>>> >>> <michaelbish@gmail.com> wrote:
>>>>> >>>> So with all due respect to the author, its annoying that the only
>>>>> >>>> ranking
>>>>> >>>> output the program saves is the final ranking.  What I've started
>>>>> >>>> doing is
>>>>> >>>> telling it to run with 0 iterations which outputs the  minimum
>>>>> >>>> violations
>>>>> >>>> ranking.  Then I change the output filename, and run it again with
>>>>> >>>> a flag
>>>>> >>>> which allows you to specify a seed.
>>>>> >>>>
>>>>> >>>> This way I'll be able to compare the minimum violations rankings to
>>>>> >>>> the
>>>>> >>>> expectation maximization algorithm which typically takes 4-8
>>>>> >>>> iterations to
>>>>> >>>> converge.  But given the computational requirements, it would seem
>>>>> >>>> to make
>>>>> >>>> sense to default to save every single ranking along the way in case
>>>>> >>>> you ever
>>>>> >>>> want to look at it later.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Sun, Oct 7, 2012 at 10:33 AM, Eric Purdy <epurdy@uchicago.edu>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> > Let me also raise a substantive issue.  Within groups there is
>>>>> >>>>> > more
>>>>> >>>>> > agreement about what is status-worthy than between groups.
>>>>> >>>>> > While I
>>>>> >>>>> > think
>>>>> >>>>> > the idea that there exists a status ranking across all the
>>>>> >>>>> > students in a
>>>>> >>>>> > school has enough truth to it to merit the work being done on
>>>>> >>>>> > it, this
>>>>> >>>>> > approach is also obscuring the large differences in what earns
>>>>> >>>>> > you
>>>>> >>>>> > status in
>>>>> >>>>> > different contexts.  A focus on the causes and consequences of
>>>>> >>>>> > status
>>>>> >>>>> > differences (status as defined in this paper) inevitably ignores
>>>>> >>>>> > the
>>>>> >>>>> > importance of groups.  Of course, there are ways of trying to
>>>>> >>>>> > address
>>>>> >>>>> > the
>>>>> >>>>> > existence of groups that complement the work done here, and
>>>>> >>>>> > maybe its
>>>>> >>>>> > not
>>>>> >>>>> > even as hard as I'm making it out to be, but you always have to
>>>>> >>>>> > draw the
>>>>> >>>>> > line somewhere, and at that point you start waving your hands.
>>>>> >>>>> > I guess
>>>>> >>>>> > I'm
>>>>> >>>>> > just saying we have to do our best to be honest with ourselves
>>>>> >>>>> > and our
>>>>> >>>>> > audience about the many things that might be important but we
>>>>> >>>>> > don't know
>>>>> >>>>> > for
>>>>> >>>>> > sure because we can't fit them in our model.
>>>>> >>>>>
>>>>> >>>>> You can model people playing different status games at the same
>>>>> >>>>> time.
>>>>> >>>>> Like, say for simplicity that there are two games in high school,
>>>>> >>>>> being smart and being cool. Then we could try to figure out which
>>>>> >>>>> game
>>>>> >>>>> each player is playing (or how much they're playing both) based on
>>>>> >>>>> that. It would make the overfitting problems much worse, but it's
>>>>> >>>>> not
>>>>> >>>>> super-hard to formulate such a model.
>>>>> >>>>>
>>>>> >>>>> Also, you could probably get some sort of independent check of
>>>>> >>>>> such
>>>>> >>>>> models by giving people surveys that try to determine what
>>>>> >>>>> strategy
>>>>> >>>>> people think they are following. Like, "how cool do you think you
>>>>> >>>>> are,
>>>>> >>>>> relative to your friends", "how do you decide whether you want to
>>>>> >>>>> be
>>>>> >>>>> friends with a new acquaintance", etc.
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> -Eric
>>>>> >>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> -Eric
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> -Eric
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > -Eric
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Eric
>>>>>
>>>>> _______________________________________________
>>>>> Math mailing list
>>>>> Math@moomers.org
>>>>> http://mailman.moomers.org/mailman/listinfo/math
>>>>>
>>>>
>>>
>>
>
>
>
> --
> -Eric



--
-Eric