by

Noah Smith on NLP at SoCS

Continuing on the SoCS workshop, the afternoon session is a tutorial from Noah Smith at CMU about using NLP for socio-computational kinds of work.

Talking about how NLP people tend to make choices in models, algorithms, collection, and cleaning decisions, rather than what are “the right answers” (which are context-dependent), feels like a nice, fruitful way to discuss using NLP for socio-computational work. We’ll start with a discussion of document classification since that’s where Noah went.

Much of the story around NLP for document classification is thoughtful annotation/labeling of your data for the categories/attributes of interest. Having good justifications, theories, and research questions that lead you to create appropriate categories for the text and goals you have. And, once you get that dataset created, share it — people love useful datasets and might help you on the work.

Likewise, thinking carefully about how to transform the texts to text features–word counts, stemming, bigrams/trigrams, defining word categories (a la Linguistic Inquiry and Word Count, or LIWC)–is important and requires a thoughtful balance of intuition and justification.

Question: what are, for NLP folks, for CSCW folks, for social science folks, the “right” or “good” ways to justify choices of category schemes, labeling, feature construction, etc.?

One answer, around choice of ML algorithm, is to say “SVM performs a little better but you need to be able to talk about probabilities, so I’ll trade off a bit of performance for other kinds of interpretability”. And, especially if you choose a linear model from features to categories, the algorithms have relatively small (and predictable) kinds of differences — perhaps more noise than is worth optimizing on, versus spending your time on other stages that require more intuition/justification/art.

Another answer is that you should pick methods that you can talk sensibly about and that your community gets: if you can’t explain it at all, or to your community, you are in a world of hurt. Practical issues around tool choice that fit your research pipeline and skills and budget also matter.

Performance is only a piece of the tradeoff — and you really want to compare it on held out data. (You can be very careful about this by taking files with your test data and making them unreadable.) Likewise, you want to compare to a reasonable baseline; at the very least, against a “predict the most common class” zero-rule baseline. You might also think about the maximum expected performance, perhaps considering inter-coder agreement as an upper bound.

Performance went bad: what went wrong? Not enough data, bad labels, meaningless features, home-grown algorithms and implementations, (perhaps) the wrong algorithm, not enough experience or insight into the domain, …

Parsing for parts of speech or entity recognition is like sharing dinners. At dinner, the people around you will influence decisions on what to order. At NLP, the words nearby (and maybe some far away) might influence the classification of the words you’re looking at. The Viterbi algorithm for sequence labeling is a useful way to account for some of these dependencies.

Noah claims that this is going to be the next big idea from NLP that makes it big in the world of computational social science, because lots of important text analysis cames including part of speech tagging, entity recognition, and translation can be modeled pretty well as sequence labeling problems. Further, the algorithms for this kind of structured prediction are more or less generalizations of standard ML classification algorithms.

That said, there are a lot of really tough problems, especially around more semantic goals such as predicting framings, where there’s some in-progress work that is dangerous to rely on but perhaps fun to play with, including some of Noah’s own group’s work.

I’m going to not cover the clustering side, because I need a little break from typing and thinking, but hopefully this was useful/interesting for some folks.

Bonus note: can you predict if a bill will make it out of committee or a paper gets cited? Yes, at least better than chance, according to their paper.

Write a Comment

Comment

  1. Noah Smith is a renowned computer scientist and professor who has contributed significantly to the field of natural language processing (NLP). I prefer to get Corporate magician that promote their magical tricks. His lecture at the School of Computer Science (SoCS) may cover various topics related to NLP, including machine learning, deep learning, and neural networks.

  2. Towcester B&B offers a comfortable and inviting accommodation option in the charming town of Towcester, England. With its warm hospitality and cozy rooms, guests can enjoy a peaceful stay in this delightful bed and breakfast. The towcester b&b provides a range of amenities, including complimentary breakfast and free Wi-Fi, ensuring a convenient and enjoyable experience. Located within easy reach of local attractions and amenities, Towcester B&B is an ideal choice for travelers seeking a homely atmosphere and a pleasant base to explore the town and its surroundings.

  3. Noah Smith’s captivating presentation on Natural Language Processing (NLP) at the Society of Computer Scientists (SoCS) left the audience spellbound, akin to one of the best magic shows in los angeles. Smith’s insights into the ever-evolving landscape of NLP showcased the remarkable strides made in understanding human language through AI. With a blend of cutting-edge research and practical applications, he demonstrated how NLP has transformed industries, making it clear that NLP’s potential is nothing short of magical in the world of technology and communication. His talk was a testament to the enchanting possibilities that lie ahead in the realm of NLP.

  4. Noah Smith’s insights on Natural Language Processing at the Symposium on Computational Science shed light on the evolving landscape of language technology. His expertise delves into the intricacies of NLP algorithms and their applications across various domains. With innovations like Ozempic, NLP continues to revolutionize how we interact with and analyze textual data, shaping the future of AI-driven communication and understanding.