How I review papers

Pernille Bjørn is spearheading a mentoring program for new reviewers as part of CSCW 2015, which I think is awesome. I am mentoring a couple of students, and I figured as long as I was talking to them about how I approach reviews I would share it with others as well [0].

The first question is how close to the deadline to do the review [1]. A lot of people do them near the deadline, partly because academics are somewhat deadline-driven. Also, in processes where there is some kind of discussion among reviewers or a response/rebuttal/revision from the authors, the less time that’s passed between your review and the subsequent action, the more context you will have.

However, I tend to do them as early as practicable given my schedule. I don’t like having outstanding tasks, and although PC members know that many reviews are last minute, it is still nervous-making [2]. I also don’t mind taking a second look at the paper weeks or even months later, in case my take on the paper has changed in the context of new things I’ve learned. And, for folks who are getting review assistance from advisors or mentors, getting the reviews done earlier is better so those people have time to give feedback [3].

I still print papers and make margin written notes whenever I can, because I find I give a little more attention in printed versus screen form [4]. If I’m reading on a screen I’ll take notes in a text editor and save them. I usually read in some comfy place like a coffeeshop (pick your own “this is nice” place: your porch, the beach, a park, whatever) so that I start with good vibes about the paper and also reward myself a little bit for doing reviews [5]. Try to do your reading and reviewing when you’re in a neutral or better mood; it’s not so fair to the authors if you’re trying to just squeeze it in, or you’re miffed about something else.

What I typically do these days is read the paper and take lots of notes on it, wherever I see something that’s smart or interesting or confusing or questionable. Cool ideas, confusing definitions, (un)clear explanations of methods, strong justifications for both design and experimental choices, notation that’s useful or not, good-bad-missing related work, figures that are readable or not, helpful or not, clever turns of phrase, typos, strong and weak arguments, etc. Anything I notice, I note.

The notes are helpful for several reasons. First, actively writing notes helps me engage with the paper more deeply [6]. Second, those notes will be handy later on, when papers are discussed and authors sumbit responses, rebuttals, or revisions. Third, they can themselves be of benefit to authors (see below).

Fourth, taking notes allows me to let it sit for a couple of days before writing the review. Not too long, or else even with the notes I’ll start forgetting some of what was going on [7]. But taking a day or two lets initial reactions and impressions fade away — sometimes you have an immediate visceral reaction either good or bad, and that’s not so fair to the authors either.

Letting it sit also lets me start to sort out what the main contributions and problems are. I make _lots_ of notes and a review that’s just a brain dump of them is not very helpful for the program committee or other reviewers. So, after a couple of days, I look back over my notes and the paper, and write the review. People have a lot of different styles; my own style usually looks something like this [8]:

—-

Summary:

2 sentences or so about the key points I’m thinking about when I’m making my recommendation. This helps the program committee, other reviewers, and authors get a feel for where things are going right away.

Main review:

1 paragraph description of paper’s goals and intended contributions. Here, I’m summarizing and not reviewing, so that other reviewers and authors feel comfortable that I’ve gotten the main points [9]. Sometimes you really will just not get it, and in those cases your review should be weighed appropriately.

1-2 paragraphs laying out the good things. This is important [10]. In a paper that’s rough, it’s still useful to talk about what’s done well: authors can use that info to know where they’re on the right track, plus it is good for morale to not just get a steady dose of criticism. In a medium or good paper, it’s important to say just what the good things are so that PC members can talk about them at the meeting and weigh them in decision-making. Sometimes you see reviews that have a high score but mostly list problems; these are confusing to both PC members and authors.

1 short paragraph listing out important problems. Smaller problems go in the “Other notes from read” section below; the ones that weigh most heavily in my evaluation are the ones that go here.

Then, one paragraph for each problem to talk about it: what and where the issue is, why I think it’s a problem. If I have suggestions on how to address it, I’ll also give those [12]. I try to be pretty sensitive about how I critique; I refer to “the paper” rather than “the authors”, and I look for things that feel mean-spirited or could be taken the wrong way.

A concluding paragraph that expands on the summary: how I weighed the good and bad and what my recommendation is for the program committee. Sometimes I’ll suggest other venues that it might fit and/or audiences I think would appreciate it, if I don’t think it’ll get in [13]. I usually wish people luck going forward, and be as positive as I can for both good and less good papers.

Other thoughts:

Here I go through my notes, page by page, and list anything that I think the authors would benefit from knowing about how a reader processed their paper. I don’t transcribe every note but I do a lot of them; I went to the effort and so I’d rather squeeze all the benefit out of it that I can.

Scores:

Different venues ask for different kinds of ratings; for CSCW, there are multiple scales. The expertise scale runs from 4 (expert) to 1 (no knowledge). I try to be honest about expertise; if I am strong with both domain and methods, I’m “expert”; if I’m okay-to-strong with both, I’m “knowledgeable”; I try not to review papers where I feel weak in either domain or methods, but I will put a “passing knowledge” if I have to, and I try hard to turn down reviews where I’d have to say “no knowledge” unless the editor/PC member/program officer is explicitly asking me to review as an outsider.

The evaluation scales change a bit from year to year. This year, the first round scale is a five-pointer about acceptability to move on to the revise and resubmit [14]: definitely, probably, maybe, probably not, not. The way I would think about it is: given that authors will have 3 weeks or so to revise the paper and respond to review comments, will that revision have a good chance of getting an “accept” rating from me in a month? And, I’d go from there.

——-

Again, not everyone writes reviews this way, but I find that it works pretty well for me and for the most part these kinds of reviews appear to be helpful to PC members and authors based on the feedback I’ve gotten. Hopefully it’s useful to you and I (and other new reviewers) would be happy to hear your own stories and opinions about the process.

Just for fun, below the footnotes are the notes I took on three papers for class last semester. These are on published final versions of papers, so there are fewer negative things than would probably show in an average review. Further, I was noting for class discussions, not reviews, so the level of detail is lower than I’d do if I were reviewing (this is more what would show up in the “other thoughts” section). I don’t want to share actual reviews of actual papers in a review state since that feels a little less clean, but hopefully these will give a good taste.

#30#

[0] Note that many other people have also wrote and thought about review in general. Jennifer Raff has a nice set of thoughts and links.

[1] Well, the first question is whether to do the review at all. Will you have time (I guesstimate 4 hrs/review on average for all the bits)? If no, say no. It’s okay. Are you comfortable reviewing this paper in terms of topic, methods, expertise? If no, say no.

[2] I was papers chair for WikiSym 2012 and although almost everything came in on time, the emphasis was on “on”.

[3] Doing your read early will also help you think about whether this is really a paper you know enough about the related work to review; when I was a student, I was pretty scared to review stuff outside my wheelhouse, and rightly so.

[4] Yes, I’m old. Plus, there’s some evidence that handwritten notes are better than typed.

[5] There’s a fair bit of literature about the value of positive affect. For example, Environmentally Induced Positive Affect: Its Impact on Self‐Efficacy, Task Performance, Negotiation, and Conflict.

[6] See the second half of [4].

[7] See the first half of [4].

[8] Yes, I realize this means that some people will learn that there’s a higher-than-normal chance that a given review is from me despite the shield of anonymity. I’m fine with that.

[9] Save things like “the authors wanted, but failed, to show X” that for the critiquey bits (and probably, say it nicer than that even there).

[10] Especially in CS/HCI, we’re known to “eat our own” in reviewing contexts [11]; program officers at NSF have told me that the average panel review in CISE is about a full grade lower than the average in other places like physics. My physicist friends would say that’s because they’re smarter, but…

[11] For instance, at CHI 2012, I was a PC member on a subcommittee. 800 reviews, total. 8 reviews gave a score of 5. That is, only 1 percent of reviewers would strongly argue that _any_ paper they read should be in the conference.

[12] Done heavy-handedly, this could come off as “I wish you’d written a different paper on a topic I like more or using a method I like more”. So I try to give suggestions that are in the context of the paper’s own goals and methods, unless I have strong reasons to believe the goals and methods are broken.

[13] There’s a version of this that’s “this isn’t really an [insert conference X] paper” that’s sometimes used to recommend rejecting a paper. I tend to be broader rather than narrower in what I’m willing to accept, but there are cases where the right audience won’t see the paper if it’s published in conference X. In those cases it’s not clear whether accepting the paper is actually good for the authors.

[14] I love revise and resubmit because it gives papers in the “flawed but interesting” category a chance to fix themselves; in a process without an R&R these are pretty hard to deal with.

Sharma, A., & Cosley, D. (2013, May). Do social explanations work?: studying and modeling the effects of social explanations in recommender systems. In Proceedings of the 22nd international conference on World Wide Web (pp. 1133-1144). International World Wide Web Conferences Steering Committee.
http://www.cs.cornell.edu/~danco/research/papers/sharma-explanations-www2013.pdf

I don’t know that we ever really pressed on the general framework, unfortunately.

It would have been nice to give explicit examples of social proof and interpersonal influence right up front; the “friends liked a restaurant” is somewhere in between.

p. 2

This whole discussion of informative, likelihood, and consumption, makes assumptions about the goals being served; in particular, it’s pretty user-focused. A retailer, especially for one-off customers (as in a tourism context), might be happy enough to make one sale and move on.

Should probably have made the explicit parallels between likelihood/consumption and the Bilgic and Mooney promotion and satisfaction.

A reasonable job of setting up the question of measuring persuasiveness from the general work (though I wish we’d explicitly compared that to Bilgic and Mooney’s setup). Also unclear that laying out all the dimensions from Tintarev really helped the argument here.

Models based on _which_ theories?

p. 3

Okay, I like the attempt to generalize across different explanation structures/info sources and to connect them to theories related to influence and decision-making.

Wish it had said “and so systems might show friends with similar tastes as well as with high tie strength” as two separate categories (though, in the CHI 06/HP tech report study, ‘friends’ beat ‘similar’ from what I remember).

Okay, mentioning that there might be different goals here.

“Reduce”, not “minimize”. You could imagine a version where you chose completely random artists and lied about which friends liked them… though that has other side effects as an experimental design (suppose, for instance, you chose an artist that a friend actually hated).

p. 4

Yeah, they kind of goofed by missing “similar friend”.

_Very_ loosely inspired. The Gilbert and Karahalios paper is fun.

Seeing all those little empty bins for ‘5’ ratings that start showing up in later figures was a little sad — I wish we’d have caught that people would want to move the slider, and done something else.

We never actually use the surety ratings, I think.

Overall this felt like a pretty clean, competent description of what happened. I wish we’d had a better strategy for collecting more data from the good friend conditions, but…

The idea of identifying with the source of the explanation was interesting to see (and ties back in some ways to Herlocker; one of the most liked explanations was a generic “MovieLens accurately predicts for you %N percent of the time” — in some ways, getting them to identify with the system itself.

p. 5

We kind of got away with not explaining how we did the coding here… probably an artifact of submitting to WWW where the HCI track is relatively new and there aren’t as many qualitative/social science researchers in the reviewing pool compared to CHI.

It’s a little inconsistent that we say that a person might be differently influenced by different explanations, but then go on to cluster people across all explanation types.

p. 6

Should have reminded in the caption, something like “the anomalous 5 (where the slider started)”

Is 5 really a “neutral” rating on the scale we used? Did we have explicit labels for ratings?

I keep seeing a typo every page or so, and it makes me sad. “continous”

Constraining parameters in theoretically meaningful ways is a good thing to do. For instance, if a parameter shouldn’t change between conditions, the models should probably be constrained so it can’t change (it’s kind of cheating to let the models fit better by changing those kinds of params).

p. 7

We talk about parameters for “the user”, but then go on to study these in aggregate. Probably “okay” but a little sloppy.

We really should have removed 5 entirely and scaled down ratings above 5. It probably wouldn’t change things drastically, but it would be mathematically cleaner as well as closer to average behavior.

So, for instance, maybe we should have constrained the discernment parameter to be the same across all models.

Not sure I believe the bit about the receptiveness and variability scores together.

p. 8

There’s an alternate explanation for the clustering, which is that some people are just “ratings tightwads” who are uninterested in giving high ratings to something that they haven’t seen.

I’m only lukewarm about the idea of personalizing explanation type, mostly because I think it’ll take scads of data, more than most systems will get about most users.
The point that likelihood and consumption are different I do like (and that we ack Bilgic and Mooney in finding this as well); and I like the idea of trying to model them separately to support different goals even better (though that too has the “you need data” problem) — we come back to this in the discussion pretty effectively (I think).

p. 9

The discussion starts with a very pithy but useful high-level recap of the findings, which is usually a good thing; you’ve been going through details for a while and so it’s good to zoom back up to the bigger picture.

The flow isn’t quite right; the first and third section headers in the discussion are actually quite similar and so probably would be better moved together.

p. 10

Jamming all the stuff about privacy into the “acceptability of social explanation” part is up and down for me. It’s better than the gratuitous nod to privacy that a lot of papers have, but it’s not as good as having it woven throughout the discussion to give context (and, it’s not connected to theories around impression management, identity work, etc., in a way it probably should be). Some parallels to this year’s 6010 class, where we did a full week on implications of applying computation to personal data last week (and talked about it sometimes as we went along).

I really like that we clearly lay out limitations.

=====

Walter S. Lasecki, Jaime Teevan, and Ece Kamar. 2014. Information extraction and manipulation threats in crowd-powered systems. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). ACM, New York, NY, USA, 248-256. DOI=10.1145/2531602.2531733 http://doi.acm.org/10.1145/2531602.2531733

Unclear that you’d want Turkers doing surveillance camera monitoring, but okay.

I like that they test it in multiple contexts.

The intro kind of begs the question here, that the problem is labeling quickly.

The idea of algorithms that use confidence to make decisions (e.g., about classification, recommendation, when to get extra info) is a good general idea, assuming your algos generate reasonable confidence estimates. There was an AI paper a while go about a crossword puzzle solving system that had a bunch of independent learners who reported confidence, and then the system that combined them used those weights and started to bias them once it saw when learners would over- or under-estimate them. Proverb: The Probabilistic Cruciaverbalist. It was a fun paper.

Okay, some concrete privacy steps, which is good.

I’m less convinced by the categorical argument that fully automated activity recognition systems are “more private” than semi-automated ones. Scaling up surveillance and having labels attached to you without human judgment are potential risks on the automated side.

p. 2
Blurring faces and silhouettes is better than nothing, but there’s a lot of “side leakage” similar to the “what about other people in your pics” kind that Jean pointed out last week: details of the room, stuff lying about, etc., might all give away clues.

I hope this paper is careful about the kinds of activities it claims the system works for, and the accuracy level. I’m happy if they talk about a set of known activities in a restricted domain based on what I’ve seen so far, but the intro is not being very careful about these things.

I usually like clear contribution statements but it feels redundant with the earlier discussion this time.

p. 3

Overall a fairly readable story of the related work and the paper’s doing a better-than-CHI-average job of talking about how it fits in.

p. 4

It’s less clear to me what a system might do with a completely novel activity label — I guess forward it, along with video, along to someone who’s in charge of the care home, etc. (?)

p. 5

I wonder if a version tuned to recognize group activities that didn’t carve the streams up into individuals might be useful/interesting.

One thing that this paper exemplifies is the idea of computers and humans working together to solve tasks in a way that neither can do alone. This isn’t novel to the paper — David McDonald led out on an NSF grant program called SoCS (Social-Computational Systems) where that was the general goal — but this is a reasonable example of that kind of system.

Oh, okay, I was wondering just how the on-demand force was recruited, and apparently it’s by paying a little bit and giving a little fun while you wait. (Worth, maybe, relating to the Quinn and Bederson motivations bit.)

You could imagine using a system like this to get activities at varying levels of detail and having people label relationships between the parts to build kinds of “task models” for different tasks (c.f. Iqbal and Bailey’s mid-2000 stuff on interruption) — I guess they talk a little bit about this on p. 6.

I was confused by the description of the combining inputs part of the algorithm.

p. 6

The references to public squares, etc., add just a touch of Big Brother.

For some reason I thought the system would also use some of the video features in its learning, but at least according to the “Training the learning model” section it’s about activity labels and sensed tags. I guess thinking about ‘sequences of objects’ as a way to identify tags is reasonable, and maybe you won’t need the RFID tags as computer vision improves, but it felt like useful info was being left on the table.

Okay, the paper’s explicitly aware of the side information leak problem and has concrete thoughts about it. This is better than a nominal nod to privacy that often shows up in papers.

p. 7

I’m not sure the evaluation of crowd versus single is that compelling to me. I guess showing that redundancy is useful here, and that it can compare with an expert is as well, but it felt a little hollow.

p. 8

I’m not sure what it means to get 85% correct on average. Not really enough detail about some of these mini-experiments here.

Heh, the whole “5 is a magic number” for user testing thing shows up again here.

I’m guessing if the expert were allowed to watch the video multiple times they too could have more detailed labels. The expert thing feels kind of weak to me. And, on p. 9 they say the expert generated labels offline — maybe they did get to review it. Really not explained enough for confidence in interpreting this.

p. 9

The idea that showing suggestions helped teach people what the desirable kinds of answers were is interesting (parallels Sukumaran et al. 2011 CHI paper on doing this in discussion forums and Solomon and Wash 2012 CSCW paper on templates in wikis). In some ways the ESP game does this as well, but more implicitly.

The intent recognition thing is kind of mysterious.

p. 10

This paper needs a limitations section. No subtlety about broadening the results beyond these domains. Cool idea, but.

===========

Hecht, B., Hong, L., Suh, B., & Chi, E. H. (2011, May). Tweets from Justin Bieber’s heart: the dynamics of the location field in user profiles. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 237-246). ACM. http://extweb-prod.parc.com/content/attachments/tweets-from-justin.pdf

Okay, a paper that looks at actual practice around what we might assume is a pretty boring field and finds that it’s not that boring after all. It’s a little sad that traditional tools get fooled by sarcasm, though.

It’s always fun to read about anachronistic services (Friendster, Buzz, etc.)
I wonder if Facebook behavior is considerably different than Twitter behavior either because of social norms or because the location field lets you (maybe) connect to others in the same location.

This does a nice job of motivating the problem and the assumption here that location data is frooty and tooty.

On the social norms front, it would be cool to see to what extent appropriation of the location field for non-locations follows social networks (i.e., if my friends are from JB’s heart, will I come from Arnold’s bicep?)

p. 2

(And, on the location question, I haven’t read Priedhosrsky et al’s 2014 paper on location inference, but I think they use signal from a user’s friends’ behavior to infer that user’s location. — which I guess the Backstrom et al paper cited here does as well)

Okay, they also do a reasonable job of explicitly motivating the “can we predict location” question, connecting it to the usefulness/privacy question that often comes up around trace data.

Nice explicit contribution statement. They don’t really talk about point 3 in the intro (I guess this is the “fooling” part of the abstract), but I’m guessing we’ll get there.

Another mostly-empty advance organizer, but then an explicit discussion of what the paper _doesn’t_ do, which is kind of cool (though maybe a little out of place in the intro — feels more like a place to launch future work from).

The RW not as exciting here, so far reading more like annotated bibliography again. For instance, I wonder if the Barkhuus et al paper would give fertile ground for speculating about the “whys” explicitly set aside earlier; saying that “our context is the Twittersphere” is not helpful.

At the end they try to talk about how it relates to the RW, but not very deeply.

p. 3

The fact that half the data can be classified as English is interesting — though I’m not sure I’m surprised it’s that little, or that much. (Which is part of why it feels interesting to me.)

Not sure I buy the sampling biases rationale for not studying geolocated info (after all, there are biases in who fills out profile info, too). This feels like one of those kind of “reviewer chow” things where a reviewer was like “what about X” and they had to jam it in somewhere. (The “not studying reasons” para at the end of the intro had a little bit of that feel as well.)

10,000 entries… whoa.

p. 4

Hating on pie charts, although the info is interesting. It does make me want to say more like “of people who enter something, about 80% of it is good” — it’s a little more nuanced.
The “insert clever phrase here” bit suggests that social practice and social proof really do affect these appropriation behaviors — and it is also cool that some themes emerge in them.

It’s tempting to connect the idea of “self report” from the other paper to the idea of “disclosing location” here. The other paper had self-disclosure as part of an experiment, which probably increases compliance — so really, part of the answer about whether data is bias is thinking about how it was collected. Not a surprising thing to say, I guess, but a really nice, clear example coming out here.

So, no info about coding agreement and resolution for the table 1 story. I’m also assuming the percents are of the 16% non-geographic population. Most of the mass isn’t here: I wonder what the “long tail” of this looks like, or if there are other categories (in-jokes, gibberish), etc. that would come out with more analysis.

The identity story is a cool one to come out, and intuitively believable. It would be cool, as with the other paper, to connect to literature that talks about the ways people think about place.

p. 5

So, the description of lat long tags as profiles is a little weird — it means we’re not really looking at 10K users. We’re looking at 9K. That’s still a lot of users, but this is something I’d probably have divulged earlier in the game.

I wonder if one of the big implications on geocoding is that geocoders should report some confidence info along the way so that programs (or people) can decide whether to believe them — oh, okay, they go on to talk about geoparsers as a way of filtering out the junk early.

p. 6

I’m not sure how to think about the implication to replace lat/long with a semantically meaningful location for profile location fields. My main reaction was like “why is Blackberry automatically adding precise location info to a profile field in the first place?”

The idea of machine-readable location + a custom vernacular label is interesting, but it’s more machinery — and it’s not clear that most people who had nonstandard info actually wanted to be locatable. The reverse implication, to force folks to select from a pre-set of places if you want to reduce appropriate, seems fine if that’s your goal. There’s something in between that combines both approaches, where the user picks a point, a geocoder suggests possible place names, and they choose from those.

All of these are a “if your goal is X” set of suggestions, though, rather than a “and this is The Right Thing To Do” kind of suggestion, and it’s good that the paper explicitly ties recommendations to design goals. Sometimes papers make their design implications sound like the voice of God, prescribing appropriate behavior in a context-free manner.

Unclear how many locations people really need; multi-location use seemed pretty rare, and it’s unclear that you want to design for the rare case (especially if it might affect the common one).

It’s often good to look for secondary analysis you can do on data you collect, and study 2 is high-level connected to study 1. Further, I tend to like rich papers that weave stories together well. But here they feel a little distant, and unless there’s some useful discussion connecting them together later I wonder if two separate papers aimed at the appropriate audiences would have increased the impact here.

p. 7

I’m not sure why CALGARI is an appropriate algorithm for picking distinguishing terms. It feels related to just choosing maximally distinguishing terms from our playing with Naive Bayes earlier, and I also wonder if measures of probability inequality (Gini, entropy, etc.) would have more info than just looking at the max probability. (Though, these are likely to be correlated.)

Again, a pretty clear description of data cleaning and ML procedure, which was nice to see.

I’d probably call “RANDOM” either “PRIOR” or “PROPORTIONAL”, which feel more accurate (I’m assuming that UNIFORM also did random selection as it was picking its N users for each category).

p. 8

Also nice that they’re explaining more-or-less arbitrary-looking parameters (such as the selection of # of validation instances).

Note that they’re using validation a little differently than we did in class: their “validation” set is our “test” set, and they’re presumably building their Bayesian classifiers by splitting the training set into training and validation data.

So, what these classifiers are doing are looking at regional differences in language and referents (recall Paul talking about identifying native language by looking at linguistic style features). It looks like, based on p. 9’s table, that referring to local things is more the story here than differences in language style.

p. 9

Not sure about the claims about being able to deceive systems by using a wider portfolio of language… it really is an analog of how spammers would need to defeat Naive Bayes classifiers.

It doesn’t really go back and do the work to tie study 1 and study 2 together for me.