I *think* you can use a weird dynamic programming-style process to draw independent random samples from the set of all rankings in time O(EV) (where E is the number of edges and V the number of nodes), which should only be a second or two if written well. Since the samples are independent, you would not need very many to get a good estimate of the things you need a good estimate for. It should be possible to get the running time down to something more reasonable, like five minutes or an hour.
I'm feeling too tired/lazy to work out all the details right now, but it would be a good challenge/exercise for people who want to work on their algorithms skillz. You have to use amortized analysis, I think.
The trick that I think will work is that, if you have a ranking of the first k nodes, you can generate a ranking of the first k+1 nodes by inserting the (k+1)st node between any two nodes of the previous ranking. (Or, it can also go at the beginning or the end, so there are k+1 places it can go.) So, if you can generate a legitimate random sample for a ranking of the first k nodes, you can extend it randomly, and get a legitimate ranking of the first (k+1) nodes. You have to be careful while doing it, though, because any edge that goes from a node before the insertion to a node after the insertion will now be between nodes that are one more rank apart than before.
So basically, you just want to make sure that you can do this updating of the sample in such a way that you satisfy equation (1) from the paper at every step.
If anyone would like to work on this problem, but needs more hints, let me know.
On Thu, Oct 4, 2012 at 3:07 PM, Michael Bishop michaelbish@gmail.com wrote:
Eric, I'm very grateful for the offer. I'm also a bit uncomfortable asking you to do that. I also don't know the benefits, and might not be able to use it without help.
On Thu, Oct 4, 2012 at 2:01 PM, Eric Purdy epurdy@uchicago.edu wrote:
You might be able to just skip the MCMC part and do exact marginalization. Do you want me to try to code that up? I have an out-of-date version of the code sitting around somewhere.
On Thu, Oct 4, 2012 at 1:50 PM, Michael Bishop michaelbish@gmail.com wrote:
Yeah, its the MCMC part... still running for 16 hours now on a year old computer with 16gb of ram. Once again this is just one, albeit the largest, of 84 networks in my dataset. I'm told the easiest thing that would help is to take advantage of my second core but I don't know how to do that.
Here's the output: $ ./highschool_rankings -i paj_files_r/comm93.scc.el -iter30 -v3 First step will be a linearly scored minimum violations ranking. Ranks initialized via simulated annealing. The score from the annealing process was -59692 Starting EM iteration 1 of a maximum of 30. Log-Likelihood at the end of the iteration = 20387.6 0.242066 1.46033e-05 0.000997235 0.0618695 5.06643e-05 -0.0331409 0.0261 432 0.0251279 -0.0177707 -0.0117149 Starting EM iteration 2 of a maximum of 30. Log-Likelihood at the end of the iteration = 19637.5 0.271515 1.45319e-05 0.000757182 0.0797456 4.72377e-05 0.0309019 -0.0246 118 -0.0258381 0.0155623 0.0114179 Starting EM iteration 3 of a maximum of 30. Log-Likelihood at the end of the iteration = 19357.8 0.281222 1.43437e-05 0.000693565 0.0887446 4.46714e-05 -0.0349585 0.0142 105 0.0235122 -0.0143165 -0.00574516 Starting EM iteration 4 of a maximum of 30. Log-Likelihood at the end of the iteration = 19278.5 0.281864 1.42948e-05 0.000692791 0.0931104 4.31624e-05 -0.0341695 0.0147 996 0.0244729 -0.0133326 -0.00491241 Starting EM iteration 5 of a maximum of 30. Log-Likelihood at the end of the iteration = 19219.5 0.288413 1.35299e-05 0.000712436 0.098817 4.00959e-05 -0.0339019 0.01445 34 0.0246769 -0.0129563 -0.00465449 Starting EM iteration 6 of a maximum of 30.
On Thu, Oct 4, 2012 at 1:38 PM, Eric Purdy epurdy@uchicago.edu wrote:
Do you know which part of the code is being slow? If it's the MCMC part (and I'm guessing it is?), I might be able to give advice on speeding that up.
On Thu, Oct 4, 2012 at 11:31 AM, Michael Bishop michaelbish@gmail.com wrote:
Thanks Eric. So, I do have the code the authors used, and it will be made public eventually, but they haven't given me permission to make it public, and I'm a reticent about sharing it over the list. If you're interested in it, let me know with an email off list, and we can discuss how you might get involved. Maybe at some point in the not-too-distant future I can just share it on the list. The code should run on any suitably cleaned up directed network data, but it has only been run on data from the Add Health study. Its very processor intensive... Last night I started running it on one of the largest of 84 networks (844 vertices and approx 4000 edges) and its still running.
On Thu, Oct 4, 2012 at 9:59 AM, Eric Purdy epurdy@uchicago.edu wrote:
First, some general advice about reading technical papers! Read the introduction, then the discussion/conclusion, then the experimental results, and THEN the technical details (in this case, the section entitled "Inference of rank from network structure"). This is out of order, obviously, but it's a good triage strategy: you read things in increasing level of difficulty, so that you always know whether you care enough to read the next, more difficult part. It's also good because you have more context to interpret the difficult parts if you've read all the easier parts of the paper.
I've put a list of wikipedia articles at the bottom of this email, under the heading "Study Guide". They are a hopefully comprehensible explanation of the math needed to understand the most technical parts of the paper. But, if you don't feel like wading through the most technical parts, you don't have to read the study guide articles in order to participate in the discussion. (Also, I'm omitting some background for technical parts that I don't know much about, like ANOVA and statistical tests of model fit. If someone feels like adding anything to the study guide, go for it.)
== Paper ==
PDF link: http://arxiv.org/pdf/1205.6822v1.pdf Arxiv page: http://arxiv.org/abs/1205.6822
== Abstract ==
In empirical studies of friendship networks participants are typically asked, in interviews or questionnaires, to identify some or all of their close friends, resulting in a directed network in which friendships can, and often do, run in only one direction between a pair of individuals. Here we analyze a large collection of such networks representing friendships among students at US high and junior-high schools and show that the pattern of unreciprocated friendships is far from random. In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one. We present a maximum-likelihood method for deducing such rankings from observed network data and conjecture that the rankings produced reflect a measure of social status. We note in particular that reciprocated and unreciprocated friendships obey different statistics, suggesting different formation processes, and that rankings are correlated with other characteristics of the participants that are traditionally associated with status, such as age and overall popularity as measured by total number of friends.
== Code ==
Mike Bishop has some C++ code that (I think?) the authors of the paper are releasing publicly. Mike, can you send that out? It would probably be a good base to start from if we want to run simulations.
== Study guide ==
These are just to be helpful, don't feel like you have to read any of them to participate. And feel free to suggest additions, or say that one of these articles wasn't very helpful.
Adjacency matrix: (http://en.wikipedia.org/wiki/Adjacency_matrix) This is useful because it gives you a dictionary between graph theory and linear algebra. It only comes up a little bit, but it's often used without any explanation.
Elementary Bayesian stuff: (http://en.wikipedia.org/wiki/Bayesian_statistics) This is a very high-level article that gives links to more in-depth articles.
Less elementary Bayesian stuff: (http://en.wikipedia.org/wiki/Bayesian_inference) This is the main piece of background for the technical section of this paper. If you really want to know this stuff, Bishop's "Pattern Recognition and Machine Learning" is the best textbook I know, although it's not accessible without a lot of mathematical background or a lot of free time.
Bayesian games: (http://en.wikipedia.org/wiki/Bayesian_game) I've never seen this before, but it seems like the appropriate math to use when thinking about social science stuff with Bayesian methods.
EM:
(http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) The workhorse of mathematically rigorous Bayesian statistics. It's a flaky piece of shit, but it's pretty much all we have.
MCMC: (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) The workhorse of mathematically rigorous frequentist statistics.
Gibbs sampling: (http://en.wikipedia.org/wiki/Gibbs_sampling) This is the most mature mathematics that specifies how emergent properties emerge from local interactions. It was first created to model thermodynamics, which is basically all about emergent properties of molecule interactions, like temperature, pressure, etc.
Ising model: (http://en.wikipedia.org/wiki/Ising_model) This is the most straightforward example of a Gibbs distribution, it's what you should think about when you try to understand them.
On Wed, Oct 3, 2012 at 10:37 AM, Eric Purdy epurdy@uchicago.edu wrote: > I am starting a new listhost, math@moomers.org. I am hoping that > it > will function much like readings, but that we can have long and > intricate technical discussions without annoying anyone. > > Some proposed ground rules: > > - There are no stupid questions, and we're interested in > everyone's > opinion. In particular, you are encouraged to subscribe to > math@moomers.org even if you do not have much mathematical > training. > If you're interested in watching math being done, that's all that > matters. > > - There will be no status games, and no arguments from authority. > Nobody should sign up to math@moomers.org because they want to > impress > anyone, and nobody can cite their professional status or > mathematical > training to make someone else shut up. > > - If any publishable research is done on math@moomers.org, we will > decide by consensus whose names should be on any papers that > result. > > The first proposed paper to read is "Friendship networks and > social > status", by Ball and Newman. It is available online at > http://arxiv.org/abs/1205.6822 . I am interested in it because it > gives a nice model of social stratification as an emergent > property > of > social networks. I think it would be cool to try to extend it, and > I > think it should be possible to give a scientific account of the > phenomenon of discursive privilege. > > Here is the abstract: > > In empirical studies of friendship networks participants are > typically > asked, in interviews or questionnaires, to identify some or all of > their close friends, resulting in a directed network in which > friendships can, and often do, run in only one direction between a > pair of individuals. Here we analyze a large collection of such > networks representing friendships among students at US high and > junior-high schools and show that the pattern of unreciprocated > friendships is far from random. In every network, without > exception, > we find that there exists a ranking of participants, from low to > high, > such that almost all unreciprocated friendships consist of a > lower-ranked individual claiming friendship with a higher-ranked > one. > We present a maximum-likelihood method for deducing such rankings > from > observed network data and conjecture that the rankings produced > reflect a measure of social status. We note in particular that > reciprocated and unreciprocated friendships obey different > statistics, > suggesting different formation processes, and that rankings are > correlated with other characteristics of the participants that are > traditionally associated with status, such as age and overall > popularity as measured by total number of friends. > > -- > -Eric
-- -Eric
Math mailing list Math@moomers.org http://mailman.moomers.org/mailman/listinfo/math
-- -Eric
-- -Eric