So back when this thread was active, I mentioned that I was concerned that Ball & Newman's attempt to measure status was overfitting the data . Eric replied:
The only way to really know whether or not you're doing it is to have
some data held out for testing/cross-validation. For example, you could erase a random 10% of the edges, do their procedure on the remaining 90%, and then see how many rank violations you get in the invisible 10% of edges.
Say cross-validation shows that they are overfitting. How hard is it to take the next step and change the algorithm so it isn't overfitting any more?
On Thu, Oct 11, 2012 at 5:45 PM, Eric Purdy epurdy@uchicago.edu wrote:
Updated to add more details.
On Thu, Oct 11, 2012 at 12:29 PM, Eric Purdy epurdy@uchicago.edu wrote:
Is anyone interested in working on the sampling algorithm I have been writing about? If not, I may finish it myself, but I don't think I'll have time for a week or so. If people are interested but have questions, you can email me off-list if you want?
On Wed, Oct 10, 2012 at 10:28 PM, Michael Bishop michaelbish@gmail.com
wrote:
Thanks for the link to the kaggle competition Kevin. I'll have some fun looking at that. Of course the approach Ball and Newman are taking is
only
one of many. Check out http://statnet.org/ for an R package which can
do
some fancy things... I think the heavy lifting implemented in C.
On Wed, Oct 10, 2012 at 10:12 PM, Kevin Wilson khwilson@gmail.com
wrote:
Just to clarify, the phenomenon I meant is very well-known is the directedness of many kinds of social graphs. This particular example of friendships among school children and the social science implications
are
rather nice.
Kevin
On Wed, Oct 10, 2012 at 10:36 PM, Kevin Wilson khwilson@gmail.com
wrote:
Hi all,
I'm Kevin. I met Eric when he came to visit Trevor at Knewton. Trevor invited me along to this list.
Anyway, the discussion is interesting, and I would point out that the phenomenon observed by this paper is very well-known. See, for
instance,
this Kaggle competition and the ensuing discussion: http://www.kaggle.com/c/FacebookRecruiting
One interesting question beyond this paper and in the direction of
where
the discussion has been going would be to ask the same questions in
larger
and larger social networks. Some example questions:
Most edges run in the direction of status. But how many triangles are there? i.e., suppose VLS has very low status, LS has low status and
HS has
high status. A triangle would consist of a relationship with VLS
claiming
friendship with LS and HS and VLS claiming friendship with HS. One may expand from triangles to longer chains (where you ask about n-length
chains
from VLS to Lady Gaga next to a direct edge from VLS to Lady Gaga). Can you distinguish types of social networks by the length of these chains? What happens when you merge two friendship networks? Do relative
statuses
remain the same?
Each of these would be a paper, but you'd probably have to vastly
improve
the code. I think data sets exist to answer all three.
Kevin
On Wed, Oct 10, 2012 at 1:02 PM, Eric Purdy epurdy@uchicago.edu
wrote:
Here is a writeup of some notes related to the tricky algorithms question I was trying to pose. Hopefully it will be helpful to anyone who is thinking about reimplementing the code from the paper.
On Sun, Oct 7, 2012 at 2:27 PM, Eric Purdy epurdy@uchicago.edu
wrote:
> GAH, left out a minus sign: > > beta = e^[ - (j-i)^2 ] > > On Sun, Oct 7, 2012 at 2:26 PM, Eric Purdy epurdy@uchicago.edu > wrote: >> Or you can also skip step 2, if you initialize alpha and beta to
be
>> something simple and reasonable, like >> >> (unreciprocated probability of edge from #i to #j) >> alpha = e^[ j-i ] >> >> (reciprocated probability of edge between #i and #j) >> beta = e^[ (j-i)^2 ] >> >> (I might have alpha and beta backwards from how they are in the >> paper?) >> >> Before I was saying that alpha and beta should be modeled as >> polynomials, but it probably makes more sense to model (log alpha) >> and >> (log beta) as polynomials, as the equations above do. That makes
it
>> more like logistic regression, which I think is supposed to work
well
>> in the social sciences? >> >> On Sun, Oct 7, 2012 at 12:35 PM, Eric Purdy epurdy@uchicago.edu >> wrote: >>> So, I'm still a bit too busy to just reimplement their code, but >>> here's a sketch of roughly how I would do it. Maybe someone else >>> will >>> feel like writing this code? It sounds like we could probably
use it
>>> to write a publishable paper, if that's a motivation. >>> >>> The easiest would be to do this in python with networkx. >>> >>> 1. Write code to separate the edges into training and test sets. >>> Basically, independently label each edge "test" with probability
10%
>>> and "training" with probability 90%. Then make a new graph that >>> doesn't have any of the test edges. This is the graph that you
will
>>> run EM on. >>> >>> 2. Get a ranking that has the minimum number of rank violations. >>> >>> 3. Write code to infer alpha and beta from a set of rankings.
This
>>> is >>> basically just a least squares problem, if you take my advice and >>> use >>> polynomials to represent alpha and beta. You will probably want
to
>>> have some sort of penalty for having large coefficients in your >>> polynomial. A common way is to put a prior on the polynomial a_0
>>> a_1 >>> x + ... + a_k x^k. A sensible prior would be a_1^2 + ... + a_k^2. >>> Note >>> that you probably want to let a_0 be whatever it wants. >>> >>> 4. Write code to sample from the set of all rankings, assuming
that
>>> alpha and beta are the correct distributions. This is the thing
that
>>> can probably be done quickly if you do some clever algorithms. >>> Basically, you generate a ranking of the first k nodes, then you >>> extend it to a ranking of the first (k+1) nodes by inserting node >>> #(k+1) somewhere in between two other nodes in the ranking. >>> >>> 5. Write code that does EM iterations. This is just going back
and
>>> forth between steps 3 and 4. In step 4, generate a bunch of
random
>>> rankings, and then use all of them when you go back to step 3.
Save
>>> all of the rankings, like Mike says. >>> >>> 6. Write code that evaluates your proposed ranking by looking at
the
>>> invisible test edges. See how the quality of the ranking changes >>> over >>> each EM iteration. >>> >>> On Sun, Oct 7, 2012 at 12:16 PM, Michael Bishop >>> michaelbish@gmail.com wrote: >>>> So with all due respect to the author, its annoying that the
only
>>>> ranking >>>> output the program saves is the final ranking. What I've
started
>>>> doing is >>>> telling it to run with 0 iterations which outputs the minimum >>>> violations >>>> ranking. Then I change the output filename, and run it again
with
>>>> a flag >>>> which allows you to specify a seed. >>>> >>>> This way I'll be able to compare the minimum violations
rankings to
>>>> the >>>> expectation maximization algorithm which typically takes 4-8 >>>> iterations to >>>> converge. But given the computational requirements, it would
seem
>>>> to make >>>> sense to default to save every single ranking along the way in
case
>>>> you ever >>>> want to look at it later. >>>> >>>> >>>> >>>> On Sun, Oct 7, 2012 at 10:33 AM, Eric Purdy <
epurdy@uchicago.edu>
>>>> wrote: >>>>> >>>>> > Let me also raise a substantive issue. Within groups there
is
>>>>> > more >>>>> > agreement about what is status-worthy than between groups. >>>>> > While I >>>>> > think >>>>> > the idea that there exists a status ranking across all the >>>>> > students in a >>>>> > school has enough truth to it to merit the work being done on >>>>> > it, this >>>>> > approach is also obscuring the large differences in what
earns
>>>>> > you >>>>> > status in >>>>> > different contexts. A focus on the causes and consequences
of
>>>>> > status >>>>> > differences (status as defined in this paper) inevitably
ignores
>>>>> > the >>>>> > importance of groups. Of course, there are ways of trying to >>>>> > address >>>>> > the >>>>> > existence of groups that complement the work done here, and >>>>> > maybe its >>>>> > not >>>>> > even as hard as I'm making it out to be, but you always have
to
>>>>> > draw the >>>>> > line somewhere, and at that point you start waving your
hands.
>>>>> > I guess >>>>> > I'm >>>>> > just saying we have to do our best to be honest with
ourselves
>>>>> > and our >>>>> > audience about the many things that might be important but we >>>>> > don't know >>>>> > for >>>>> > sure because we can't fit them in our model. >>>>> >>>>> You can model people playing different status games at the same >>>>> time. >>>>> Like, say for simplicity that there are two games in high
school,
>>>>> being smart and being cool. Then we could try to figure out
which
>>>>> game >>>>> each player is playing (or how much they're playing both)
based on
>>>>> that. It would make the overfitting problems much worse, but
it's
>>>>> not >>>>> super-hard to formulate such a model. >>>>> >>>>> Also, you could probably get some sort of independent check of >>>>> such >>>>> models by giving people surveys that try to determine what >>>>> strategy >>>>> people think they are following. Like, "how cool do you think
you
>>>>> are, >>>>> relative to your friends", "how do you decide whether you want
to
>>>>> be >>>>> friends with a new acquaintance", etc. >>>>> >>>>> -- >>>>> -Eric >>>> >>>> >>> >>> >>> >>> -- >>> -Eric >> >> >> >> -- >> -Eric > > > > -- > -Eric
-- -Eric
Math mailing list Math@moomers.org http://mailman.moomers.org/mailman/listinfo/math
-- -Eric
-- -Eric