So back when this thread was active, I mentioned that I was concerned that Ball & Newman's attempt to measure status was overfitting the data . Eric replied:
The only way to really know whether or not you're doing it is to have
some data held out for testing/cross-validation. For example, you
could erase a random 10% of the edges, do their procedure on the
remaining 90%, and then see how many rank violations you get in the
invisible 10% of edges.
Updated to add more details.
--
On Thu, Oct 11, 2012 at 12:29 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
> Is anyone interested in working on the sampling algorithm I have been
> writing about? If not, I may finish it myself, but I don't think I'll
> have time for a week or so. If people are interested but have
> questions, you can email me off-list if you want?
>
> On Wed, Oct 10, 2012 at 10:28 PM, Michael Bishop <michaelbish@gmail.com> wrote:
>> Thanks for the link to the kaggle competition Kevin. I'll have some fun
>> looking at that. Of course the approach Ball and Newman are taking is only
>> one of many. Check out http://statnet.org/ for an R package which can do
>> some fancy things... I think the heavy lifting implemented in C.
>>
>>
>>
>> On Wed, Oct 10, 2012 at 10:12 PM, Kevin Wilson <khwilson@gmail.com> wrote:
>>>
>>> Just to clarify, the phenomenon I meant is very well-known is the
>>> directedness of many kinds of social graphs. This particular example of
>>> friendships among school children and the social science implications are
>>> rather nice.
>>>
>>> Kevin
>>>
>>>
>>> On Wed, Oct 10, 2012 at 10:36 PM, Kevin Wilson <khwilson@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm Kevin. I met Eric when he came to visit Trevor at Knewton. Trevor
>>>> invited me along to this list.
>>>>
>>>> Anyway, the discussion is interesting, and I would point out that the
>>>> phenomenon observed by this paper is very well-known. See, for instance,
>>>> this Kaggle competition and the ensuing discussion:
>>>> http://www.kaggle.com/c/FacebookRecruiting
>>>>
>>>> One interesting question beyond this paper and in the direction of where
>>>> the discussion has been going would be to ask the same questions in larger
>>>> and larger social networks. Some example questions:
>>>>
>>>> Most edges run in the direction of status. But how many triangles are
>>>> there? i.e., suppose VLS has very low status, LS has low status and HS has
>>>> high status. A triangle would consist of a relationship with VLS claiming
>>>> friendship with LS and HS and VLS claiming friendship with HS. One may
>>>> expand from triangles to longer chains (where you ask about n-length chains
>>>> from VLS to Lady Gaga next to a direct edge from VLS to Lady Gaga).
>>>> Can you distinguish types of social networks by the length of these
>>>> chains?
>>>> What happens when you merge two friendship networks? Do relative statuses
>>>> remain the same?
>>>>
>>>> Each of these would be a paper, but you'd probably have to vastly improve
>>>> the code. I think data sets exist to answer all three.
>>>>
>>>> Kevin
>>>>
>>>> On Wed, Oct 10, 2012 at 1:02 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
>>>>>
>>>>> Here is a writeup of some notes related to the tricky algorithms
>>>>> question I was trying to pose. Hopefully it will be helpful to anyone
>>>>> who is thinking about reimplementing the code from the paper.
>>>>>
>>>>> On Sun, Oct 7, 2012 at 2:27 PM, Eric Purdy <epurdy@uchicago.edu> wrote:
>>>>> > GAH, left out a minus sign:
>>>>> >
>>>>> > beta = e^[ - (j-i)^2 ]
>>>>> >
>>>>> > On Sun, Oct 7, 2012 at 2:26 PM, Eric Purdy <epurdy@uchicago.edu>
>>>>> > wrote:
>>>>> >> Or you can also skip step 2, if you initialize alpha and beta to be
>>>>> >> something simple and reasonable, like
>>>>> >>
>>>>> >> (unreciprocated probability of edge from #i to #j)
>>>>> >> alpha = e^[ j-i ]
>>>>> >>
>>>>> >> (reciprocated probability of edge between #i and #j)
>>>>> >> beta = e^[ (j-i)^2 ]
>>>>> >>
>>>>> >> (I might have alpha and beta backwards from how they are in the
>>>>> >> paper?)
>>>>> >>
>>>>> >> Before I was saying that alpha and beta should be modeled as
>>>>> >> polynomials, but it probably makes more sense to model (log alpha)
>>>>> >> and
>>>>> >> (log beta) as polynomials, as the equations above do. That makes it
>>>>> >> more like logistic regression, which I think is supposed to work well
>>>>> >> in the social sciences?
>>>>> >>
>>>>> >> On Sun, Oct 7, 2012 at 12:35 PM, Eric Purdy <epurdy@uchicago.edu>
>>>>> >> wrote:
>>>>> >>> So, I'm still a bit too busy to just reimplement their code, but
>>>>> >>> here's a sketch of roughly how I would do it. Maybe someone else
>>>>> >>> will
>>>>> >>> feel like writing this code? It sounds like we could probably use it
>>>>> >>> to write a publishable paper, if that's a motivation.
>>>>> >>>
>>>>> >>> The easiest would be to do this in python with networkx.
>>>>> >>>
>>>>> >>> 1. Write code to separate the edges into training and test sets.
>>>>> >>> Basically, independently label each edge "test" with probability 10%
>>>>> >>> and "training" with probability 90%. Then make a new graph that
>>>>> >>> doesn't have any of the test edges. This is the graph that you will
>>>>> >>> run EM on.
>>>>> >>>
>>>>> >>> 2. Get a ranking that has the minimum number of rank violations.
>>>>> >>>
>>>>> >>> 3. Write code to infer alpha and beta from a set of rankings. This
>>>>> >>> is
>>>>> >>> basically just a least squares problem, if you take my advice and
>>>>> >>> use
>>>>> >>> polynomials to represent alpha and beta. You will probably want to
>>>>> >>> have some sort of penalty for having large coefficients in your
>>>>> >>> polynomial. A common way is to put a prior on the polynomial a_0 +
>>>>> >>> a_1
>>>>> >>> x + ... + a_k x^k. A sensible prior would be a_1^2 + ... + a_k^2.
>>>>> >>> Note
>>>>> >>> that you probably want to let a_0 be whatever it wants.
>>>>> >>>
>>>>> >>> 4. Write code to sample from the set of all rankings, assuming that
>>>>> >>> alpha and beta are the correct distributions. This is the thing that
>>>>> >>> can probably be done quickly if you do some clever algorithms.
>>>>> >>> Basically, you generate a ranking of the first k nodes, then you
>>>>> >>> extend it to a ranking of the first (k+1) nodes by inserting node
>>>>> >>> #(k+1) somewhere in between two other nodes in the ranking.
>>>>> >>>
>>>>> >>> 5. Write code that does EM iterations. This is just going back and
>>>>> >>> forth between steps 3 and 4. In step 4, generate a bunch of random
>>>>> >>> rankings, and then use all of them when you go back to step 3. Save
>>>>> >>> all of the rankings, like Mike says.
>>>>> >>>
>>>>> >>> 6. Write code that evaluates your proposed ranking by looking at the
>>>>> >>> invisible test edges. See how the quality of the ranking changes
>>>>> >>> over
>>>>> >>> each EM iteration.
>>>>> >>>
>>>>> >>> On Sun, Oct 7, 2012 at 12:16 PM, Michael Bishop
>>>>> >>> <michaelbish@gmail.com> wrote:
>>>>> >>>> So with all due respect to the author, its annoying that the only
>>>>> >>>> ranking
>>>>> >>>> output the program saves is the final ranking. What I've started
>>>>> >>>> doing is
>>>>> >>>> telling it to run with 0 iterations which outputs the minimum
>>>>> >>>> violations
>>>>> >>>> ranking. Then I change the output filename, and run it again with
>>>>> >>>> a flag
>>>>> >>>> which allows you to specify a seed.
>>>>> >>>>
>>>>> >>>> This way I'll be able to compare the minimum violations rankings to
>>>>> >>>> the
>>>>> >>>> expectation maximization algorithm which typically takes 4-8
>>>>> >>>> iterations to
>>>>> >>>> converge. But given the computational requirements, it would seem
>>>>> >>>> to make
>>>>> >>>> sense to default to save every single ranking along the way in case
>>>>> >>>> you ever
>>>>> >>>> want to look at it later.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Sun, Oct 7, 2012 at 10:33 AM, Eric Purdy <epurdy@uchicago.edu>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> > Let me also raise a substantive issue. Within groups there is
>>>>> >>>>> > more
>>>>> >>>>> > agreement about what is status-worthy than between groups.
>>>>> >>>>> > While I
>>>>> >>>>> > think
>>>>> >>>>> > the idea that there exists a status ranking across all the
>>>>> >>>>> > students in a
>>>>> >>>>> > school has enough truth to it to merit the work being done on
>>>>> >>>>> > it, this
>>>>> >>>>> > approach is also obscuring the large differences in what earns
>>>>> >>>>> > you
>>>>> >>>>> > status in
>>>>> >>>>> > different contexts. A focus on the causes and consequences of
>>>>> >>>>> > status
>>>>> >>>>> > differences (status as defined in this paper) inevitably ignores
>>>>> >>>>> > the
>>>>> >>>>> > importance of groups. Of course, there are ways of trying to
>>>>> >>>>> > address
>>>>> >>>>> > the
>>>>> >>>>> > existence of groups that complement the work done here, and
>>>>> >>>>> > maybe its
>>>>> >>>>> > not
>>>>> >>>>> > even as hard as I'm making it out to be, but you always have to
>>>>> >>>>> > draw the
>>>>> >>>>> > line somewhere, and at that point you start waving your hands.
>>>>> >>>>> > I guess
>>>>> >>>>> > I'm
>>>>> >>>>> > just saying we have to do our best to be honest with ourselves
>>>>> >>>>> > and our
>>>>> >>>>> > audience about the many things that might be important but we
>>>>> >>>>> > don't know
>>>>> >>>>> > for
>>>>> >>>>> > sure because we can't fit them in our model.
>>>>> >>>>>
>>>>> >>>>> You can model people playing different status games at the same
>>>>> >>>>> time.
>>>>> >>>>> Like, say for simplicity that there are two games in high school,
>>>>> >>>>> being smart and being cool. Then we could try to figure out which
>>>>> >>>>> game
>>>>> >>>>> each player is playing (or how much they're playing both) based on
>>>>> >>>>> that. It would make the overfitting problems much worse, but it's
>>>>> >>>>> not
>>>>> >>>>> super-hard to formulate such a model.
>>>>> >>>>>
>>>>> >>>>> Also, you could probably get some sort of independent check of
>>>>> >>>>> such
>>>>> >>>>> models by giving people surveys that try to determine what
>>>>> >>>>> strategy
>>>>> >>>>> people think they are following. Like, "how cool do you think you
>>>>> >>>>> are,
>>>>> >>>>> relative to your friends", "how do you decide whether you want to
>>>>> >>>>> be
>>>>> >>>>> friends with a new acquaintance", etc.
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> -Eric
>>>>> >>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> -Eric
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> -Eric
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > -Eric
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Eric
>>>>>
>>>>> _______________________________________________
>>>>> Math mailing list
>>>>> Math@moomers.org
>>>>> http://mailman.moomers.org/mailman/listinfo/math
>>>>>
>>>>
>>>
>>
>
>
>
> --
> -Eric
-Eric