Sunday, April 21, 2013

International Conference of Computational Photography 2013 (ICCP 2013) Day 1 recap

Yesterday was the first day of ICCP 2013.  While the conference should have started Friday, it was postponed until Saturday due to the craziness in Boston.  Nevertheless, it was an excellent day of mingling with colleagues and listening to talks/posters.  Here are some noteworthy items from Saturday:


Marc Levoy (from Stanford and Google) gave a keynote about Google Glass and how it will change the game of photography and video collection.  Marc was one of three Googlers wearing Glass.  The other two were Sam Hasinoff (former MIT CSAILer) and Peyman Milanfar (from UCSC/Google).  I had the privilege of chatting with Prof. Milanfar during Saturday night's reception at the Harvard Faculty club and got to share my personal views on what Glass means for Robotics researchers like myself.

Marc Levoy at ICCP 2013


During his presentation, Matthias Grundman from Georgia Tech talked about his work on radiometric self-calibration of videos and the implications of his work for visual object recognition from YouTube videos is fairly evident.  In other words, why have your machine learning algorithm deal with a source of appearance variations due to the imaging process if it can be removed!
Matthias Grundman at ICCP 2013

Post-processing Approach for Radiometric Self-Calibration of Video. Matthias Grundmann (Georgia Tech), Chris McClanahan (Georgia Tech), Sing Bing Kang (Microsoft Research), Irfan Essa (Georgia Tech). ICCP 2013



Hany Farid from Dartmouth University presented an excellent keynote on Image Forensics.  Image manipulators beware!  His work is not going to make image forgery impossible, but it will take it out of the hands of amateurs.
Hany Farid at ICCP 2013



The best paper award was given to the following paper:
"3Deflicker from Motion" by Yohay Swirski (Technion), Yoav Schechner (Technion)

Good job Yohay and Yoav!


Finally, we (the MIT object detection hackers) will be setting up our own wearable computing platform, the HOGgles box, for the Demo session during lunch.  Carl Vondrick, Aditya Khosla, and I will also be there during the coffee breaks after lunch with the HOGgles demo.

Today should be as much as yesterday and I will try to upload some videos of HOGgles in action later tonight.
--Tomasz

Friday, April 19, 2013

Can you pass the HOGgles test? Inverting and Visualizing Features for Object Detection

Despite more than a decade of incessant research by some of the world's top computer vision researchers, we still ask ourselves "Why is object detection such a difficult problem?"

Surely, better features, better learning algorithms, and better models of visual object categories will result in improved object detection performance.  But instead of waiting an indefinite time until the research world produces another Navneet Dalal (of HOG fame) or Pedro Felzenszwalb (of DPM fame), we (the vision researchers in Antonio Torralba's lab at MIT) felt the time was ripe to investigate object detection failures from an entirely new perspective.

When we (the researchers) look at images, the problem of object detection appears trivial; however, object detection algorithms don't typically analyze raw pixels, they analyze images in feature spaces!  The Histogram of Oriented Gradients feature (commonly known as HOG) is the de-facto standard in object detection these days.  While looking at gradient distributions might make sense for machines, we felt that these features were incomprehensible to the (human) researchers who have to make sense of object detection failures.  Here is a motivating quote from Marcel Proust (a French novelist), which most accurately describes what we did:

The real voyage of discovery consists not in seeking new landscapes but in having new eyes.” -- Marcel Proust



In short, we built new eyes.  These new "eyes" are a method for converting machine readable features into human readable RGB images.  We take a statistical machine learning approach to visualization -- we learn how to invert HOG using ideas from sparse coding and large-scale dictionary learning.  Let me briefly introduce the concept of HOGgles (i.e., HOG glasses).

Taken from Carl Vondrick's project abstract:
We present several methods to visualize the HOG feature space, a common descriptor for object detection. The tools in this paper allow humans to put on "HOG glasses" and see the visual world as a computer might see it.

Here is an example of a short video (movie trailer for Terminator) which shows the manually engineered HOG visualization (commonly know as the HOG glyph), the original image, and our learned iHOG visualization.


We are presenting a real-time demo of this new and exciting line of work at the 2013 International Conference of Computational Photography (ICCP2013) which is being held at Harvard University this weekend (4/19/2013 - 4/21/2013).  If you want to try our sexy wearable platform and become a real-life object detector for a few minutes, then come check us out at this Sunday morning's demo session at ICCP2013 at Harvard University.



Also, if you thought TorralbaArt was cool, you must check out VondrickArt (a result of trying to predict color using the iHOG visualization framework)

Project-related Links:

Project website: http://mit.edu/vondrick/ihog/
Project code (MATLAB-based) on Github: https://github.com/CSAILVision/ihog
arXiv paper: http://mit.edu/vondrick/ihog/techreport.pdf

Authors' webpages:

Carl Vondrick (MIT PhD student): http://mit.edu/vondrick/
Aditya Khosla (MIT PhD student): http://people.csail.mit.edu/khosla/
Tomasz Malisiewicz (MIT Postdoctoral Fellow): http://people.csail.mit.edu/tomasz/
Antonio Torralba (MIT Professor): http://web.mit.edu/torralba/www/

We hope that with these new eyes, we (the vision community) will better understand the failures and successes of machine vision systems.  I, for one, welcome our new HOGgles wearing overlords.

Tuesday, July 10, 2012

Machine Learning Doesn't Matter?



Bagpipes and International Conference of Machine Learning (ICML) in Edinburgh
Two weeks ago, I attended the ICML 2012 Conference in Edinburgh, UK.  First of all, Edinburgh is a great place for a conference!  The scenery is marvelous, the weather is comfortable, and most notably, the sound of bagpipes adds an inimitable charm to the city.  I attended the conference because I was invited to give an invited applications talk during the invited talks session.  In case you’re wondering, I did not have a plenary session (a plenary session is a session attended by all conference members) which is preserved for titans such as Yann Lecun, David MacKay, and Andrew Ng.  My presentation was on the last day of ICML and was titled “Exemplar-SVMs for Visual Object Detection, Label Transfer and Image Retrieval,” during which I gave an overview of my ICCV 2011 paper on visual object detection as well as the SIGGRAPH ASIA 2011 paper on cross-domain image retrieval.  As part of the invited talk, we submitted a 2 page extended abstract which summarizes some key ideas behind the exemplar-svm project: you can check out the abstract as well as the presentation slides online.  I believe the talk was recorded, so I will post the video link once it becomes available.  It was a great opportunity to convey some of my ideas to a non-vision audience.  I think I got a handful of new people excited about single example SVMs (i.e., Exemplar-SVMs)!

Tomasz Malisiewicz, Abhinav Shrivastava, Abhinav Gupta, and Alexei A. Efros. Exemplar-SVMs for Visual Object Detection, Label Transfer and Image Retrieval. To be presented as an invited applications talk at ICML, 2012. PDF | Talk Slides


Getting Ready for Edinburgh with David Hume
To get ready for my first visit to Edinburgh (pronounced Ed-in-bur-ah which does not rhyme with Pittsburgh), I bought a Kindle Touch and proceeded to read David Hume’s An Enquiry Concerning Human Understanding.  David Hume is one of the great British Empiricists (together with John Locke and George Berkeley) who stood by the empiricist motto: impressions are the source of all ideas.  Empiricists can be contrasted to rationalists who appeal to reason as the source of knowledge.  [Of course, I am neither an empiricist nor a rationalist.  Such polarizing extremes are a thing of the past.  I am a pragmatists and my world-view combines elements from many different philosophies.]  I choose Hume’s treatise because he is the one whom Kant credits for awakening him from his dogmatic slumber.  I found Hume’s words rejuvenating, full of gedankenexperiments which show the limits of radical empiricism, and most notably is free on the Kindle store!  In your attempts to build intelligent machines, maybe you will also words of inspiration in the classics.  It was a great book to get into the Edinburgh mindset (although the ICML crowd is probably more familiar with a different University of Edinburgh figure, namely Reverend Bayes).

Impressions of ICML
I would first like to first say that the ICML website is well-organized and serves as a great tool during the conference!  Good job ICML!  There is a great mobile version of the ICML website which is excellent for visiting on your iPhone when figuring out which talk to go to next.  The ICML website also provides a forum for discussing papers and every paper gets a presentation and a poster.  The discussion boards do not seem heavily utilized but it would be great to use a moderator-style system to have the actual after-presentation questions come from this forum.  I’m sure something like this will actually arise in the upcoming years.  ICML is much smaller than CVPR (compare ~700 attendees with ~2000 attendees) which makes for a much more intimate environment.  I was amazed by the number of people proving bounds and doing “theoretical” non-applied machine learning.  Its like some people really don't care about anything other than analysis.  However, this is not my style, and I personally prefer to build “real” systems and combine insights from disparate disciplines such as mathematics, cognitive science, philosophy, physics, and computer science.  There is a bit of ICML and Machine Learning conferences which I think of as nothing more than mathturbation.  I understand there's merit to doing analysis of this sort -- somebody’s gotta do it, but if you’re gonna do it, please at least try to understand the implications of the real-world problem your dataset and task are trying to address.

Machine Learning doesn’t Matter?
The highlight of the conference by far was Kiri Wagstaff’s plenary talk “Machine Learning that Matters.”  Kiri gave an enchanting 30 minute presentation regarding what is rotten in the state of Edinburgh (aka what is wrong with the style of machine learning conferences).  Her words were gentle, yet harsh, while simultaneously enlightening, yet morbid.  She showed us, machine learning researchers, just how useless much of machine learning research is today.  Let’s not forget that Machine Learning is one of the most revolutionary ideas if the modern computer science classroom.  Trying to get a PhD in Computer Science and avoiding Machine Learning is like avoiding Calculus while getting and undergraduate degree in Engineering.  There is nothing wrong with machine learning as a discipline, but there is something wrong with researchers making the field overly academic.  Making a discipline overly academic means creating a self-contained, overly-mathematical, self-citing, and jargon-filled discipline which doesn’t care about world-impact but only cares to propagate a small community’s citation count.  Note that much of these arguments also apply to the CVPR world. But do not take my words for granted, read Kiri’s treatise yourself.  Abstract Below:


"Machine Learning that Matters" Abstract: Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society. From this perspective, there exist glaring limitations in the data sets we investigate, the metrics we employ for evaluation, and the degree to which results are communicated back to their originating domains. What changes are needed to how we conduct research to increase the impact that ML has? We present six Impact Challenges to explicitly focus the field’s energy and attention, and we discuss existing obstacles that must be addressed. We aim to inspire ongoing discussion and focus on ML that matters.

Kiri Wagstaff, "Machine Learning that Matters," ICML 2012.


If you have something to say in response to Kiri's treatise, check out her Machine Learning Impact Forum on http://mlimpact.com/.

Thursday, June 21, 2012

Predicting events in videos, before they happen. CVPR 2012 Best Paper

Intelligence is all about making inferences given observations, but somewhere in the history of Computer Vision, we (as a community) have put too much emphasis on classification tasks.  What many researchers in the field (unfortunately this includes myself) focus on is extracting semantic meaning from images, image collections, and videos.  Whether the output is a scene category label, an object identity and location, or an action category, the way we proceed is relatively straightforward:
  • Extract some measurements from the image (we call them "features", and SIFT and HOG are two very popular such features)
  • Feed those features into a machine learning algorithm which predicts the category these features belong to.  Some popular choices of algorithms are Neural Networks, SVMs, decision trees, boosted decision stumps, etc.
  • Evaluate our features on a standard dataset (such as Caltech-256, PASCAL VOC, ImageNet, LabelMe, etc)
  • Publish (or as is commonly know in academic circles: publish-or-perish)
While only in the last 5 years has action recognition become popular, it still adheres to the generic machine vision pipeline.  But let's consider a scenario where adhering to this template can hav disastrous consequences.  Let's ask ourselves the following question:

Q: Why did the robot cross the road?
Image courtesy of napkinville.com

A: The robot didn't cross the road -- he was obliterated by a car.  This is because in order to make decisions in the world you can't just wait until all observations happened.  To build a robot that can cross the road, you need to be able to predict things before they happen! (Alternate answer: The robot died because he wasn't using Minh's early-event detection framework, the topic of today's blog post.)

This year's Best Student Paper winner at CVPR has given us a flavor of something more, something beyond the traditional action recognition pipeline, aka "early event detection."  Simply put, the goal is to detect an action before it completes.  Minh's research is rather exciting, which opens up room for a new paradigm in recognition.  If we want intelligent machines roaming the world around us (and every CMU Robotics PhD student knows that this is really what vision is all about), then recognition after an action has happened will not enable our robots to do much beyond passive observation.  Prediction (and not classification) is the killer app of computer vision because classification assumes you are given the data and prediction assumes there is an intent to act on and interpret the future.


While Minh's work focused on simpler actions such as facial recognition, gesture recognition, and human activity recognition, I believe these ideas will help make machines more intelligent and more suitable for performing actions in the real world.

 Disgust detection example from CVPR 2012 paper
 


To give the vision hackers a few more details, this framework uses Structural SVMs (NOTE: trending topic at CVPR) and is able to estimate the probability of an action happening before it actually finishes.  This is something which we, humans, seem to do all the time but has been somehow neglected by machine vision researchers.


Max-Margin Early Event Detectors.
Hoai, Minh & De la Torre, Fernando
CVPR 2012

Abstract:
The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from human-robot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach. To the best of our knowledge, this is the first paper in the literature of computer vision that proposes a learning formulation for early event detection.

Early Event Detector Project Page (code available on website)

Minh gave an excellent, enthusiastic, and entertaining presentation during day 3 of CVPR 2012 and was definitely one of the highlights of that day. He received his PhD from CMU's Robotics Institute (like me, yipee!) and is currently a Postdoctoral research scholar in Andrew Zissermann's group in Oxford.  Let's all congratulate Minh for all his hard work.


CVPR 2012 Day 2: optimize, optimize, optimize

Due to popular request, here is my overview of some of the coolest stuff from Day 2 of CVPR 2012 in Providence, RI.  While the Lobster dinner was the highlight for many of us, there were also some serious learning/optimization-based papers presented during Day 2 worthy of sharing.  Here are some of the papers which left me with a very positive impression.


Dennis Strelow of Google Research in Mountain View presented a general framework for Wiberg minimization.  This is a strategy for minimizing objective functions with multiple variables -- objectives which are typically tackled in an EM-style fashion.  The idea is to express one of the variables as a linear function of the other variable, effectively making the problem depend on only one set of variables.  The technique is quite general and has been shown to produce state-of-the-art results on a bundle adjustment problem.  I know Dennis from my second internship at Google where we worked on some sparse-coding problems.  If you perform lots of matrix decomposition problems, check out his paper!


Dennis Strelow
General and Nested Wiberg Minimization
CVPR 2012


Another cool paper which is all about learning is Hossein Mobahi's algorithm for optimizing objectives by smoothing them to avoiding getting stuck in local minima.  This paper is not about blurry images, but about applying Gaussians to objective functions.  In fact, for the problem of image alignment, Hossein provides closed form versions of image operators.  Now when you apply these operators to images, you efficiently smooth the underlying cross-correlation alignment objective.  You decrease the blur, while following the optimum path, and get much nicer answers that doing naive image alignment.


Hossein Mobahi, C. Lawrence Zitnick, Yi Ma
Seeing through the Blur
CVPR 2012


Ira Kemelmacher-Shlizerman, of Photobios fame, showed a really cool algorithm for computing optical flow between two different faces based on learning a subspace (using a large database of faces).  The ideas is quite simple and allows for flowing between two very different faces where the underlying operation produces a sequence of intermediate faces in an interpolation-like manner.  She shared this video with us during her presentation, but it is on Youtube, so now you can enjoy it for yourself.


Ira Kemelmacher-Shlizerman, Steven M. Seitz
Collection Flow
CVPR 2012



Now talk about cool ideas!  Pyry, of CMU fame, presented a recommendation engine for classifiers.  The idea is to take techniques from collaborative filtering (think Netflix!) and apply then to the classifier selection problem.  Pyry has been working on action recognition and the ideas presented in this work are not only quite general, but have are quite intuitive and likely to benefit anybody working with large collections of classifiers.

Pyry Matikainen, Rahul Sukthankar, Martial Hebert
Model Recommendation for Action Recognition
CVPR 2012


And finally, a super-easy algorithm presented for metric learning by Martin Köstinger had me intrigued!  This a Mahalanobis distance metric learning paper which uses equivalence relationships.  This means that you are given pairs of similar items and pairs of dissimilar items.  The underlying algorithm is really not much more than fitting two covariance matrices, one to the positive equivalence relations, and another to the non-equivalence relations.  They have lots of code online, and if you don't believe that such a simple algorithm can beat LMNN (Large-Margin Nearest Neighbor from Killian Weinberger), then get their code and hack away!

Martin Köstinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, Horst Bischof
Large Scale Metric Learning from Equivalence Constraints
CVPR 2012



CVPR 2012 gave us many very math-oriented papers, and while I cannot list of all of them, I hope you found my short list useful.



Tuesday, June 19, 2012

CVPR 2012 Day 1: Accidental Cameras, Large Jigsaws, and Cosegmentation

Today ended the first day of CVPR 2012 in Providence, RI.  And here's a quick recap:
  • On the administrative end of things, Deva Ramanan received an award for his contributions to the field as a new young CVPR researcher.  This is a new nomination-based award so be sure to vote for your favorite vision scientists next year!  Deva's work has truly influenced the field and he is well-known for being a co-author of the Felzenszwalb et al. DPM object detector, but since then he has pushed his ideas on part-based models to the next level.  Congratulations Deva, you are the type of researcher we should all strive to be.  
  • Secondly, it looks like CVPR 2015 will be in Boston.
  • Here are some noteworthy papers from the oral sessions of Day 1:


During the first oral session, Antonio Torralba gave an intriguing talk where he showed the world how accidental anti-pinhole and pin-speck cameras are "all around us."  In his presentation, he showed how a person walking in front of a window can be used to image the world outside of a window.  Additionally he showed a variant of image-based Van-Eck phreaking, where his technique could be used to view what is on a person's computer screen without having to look at the screen directly.

Accidental pinhole and pinspeck cameras: revealing the scene outside the picture
Antonio Torralba and William T. Freeman
CVPR 2012


Andrew Gallagher gave a really great presentation on using computer vision to solve jigsaw puzzles, where not only are the pieces jumbled, but their orientation is unknown.  His algorithm was used to solve really really large puzzles, ones which are much larger than could be tackled by a human.

Jigsaw Puzzles with Pieces of Unknown Orientation
Andrew Gallagher
CVPR 2012


Gunhee Kim presented his newest work on co-segmentation.  He has been working on this for quite some time and if you are interested in segmentation in image collections, you should definitely check it out.

On Multiple Foreground Cosegmentation
Gunhee Kim and Eric P. Xing
CVPR 2012


Sunday, June 17, 2012

Workshop on Egocentric Vision @ CVPR 2012

Today (Sunday 6/17/2012) is the second day of CVPR 2012 workshops and I'll be going to the Egocentric Vision workshop.  The workshop kicks off at 8:50am (come earlier for some CVPR breakfast) and will start with a keynote talk by Takeo Kanade.  There will also be a talk by Hartmut Neven of Neven-vision and now a part of Google.  Also during the poser session, my fellow colleague, Abhinav Shrivastava, will be presenting his work on applying ExemplarSVMs to detection from a first-person point of view --- yet another super-cool application of ExemplarSVMs.

Object detection from first person's view using exemplar SVMs

There are lots of other plenty of cool talks during this workshop including: action recognition from a first-person point of view, experience classification, as well as a study of the obtrusiveness of wearable computing platforms by some fellow MIT vision hackers.

The accuracy-obtrusiveness tradeoff for wearable vision platforms

You might be thinking, "What is egocentric vision?" but nothing explains it better than the following video from Google about its super exciting research project codename Project Glass.  I'm really hoping Hartmut talks about this...


If you're looking for me, you know where I'll be tomorrow.  Happy computing.

Wednesday, May 23, 2012

Why your vision lab needs a reading group

I have a certain attitude when it comes to computer vision research -- don't do it in isolation. Reading vision papers on your own is not enough.  Learning how your peers analyze computer vision ideas will only strengthen your own understanding of the field and help you become a more critical thinker.  And that is why at places like CMU and MIT we have computer vision reading groups.  The computer vision reading group at CMU (also known as MISC-read to the CMU vision hackers) has a long tradition, and Martial Hebert has made sure it is a strong part of the CMU vision culture.  Others ex-CMU hackers such as Sanjiv Kumar have continued the vision reading group tradition onto places such as Google Research in NY (correct me if this is no longer the case).  I have continued the reading group tradition to MIT (where I'm currently a postdoc) because I was surprised there wasn't one already!  In reality, we spend so much time talking about papers in an informal setting, that I felt it was a shame to not do so in a more organized fashion.
My personal philosophy is that as a vision researcher, the way towards the goal of creating novel long-lasting ideas is learning how others think about the field.  There's a lot of value in being able to analyze, criticize, and re-synthesize other researchers' ideas.  Believe me when I say that a lot of new vision papers come out of top tier vision conferences every year.  You should be reading them!  But not just reading, also criticizing them among your peers.  Because once you learn to criticize others' ideas, you will become better at promulgating your own.  Do not equate criticism with nasty words for the sake of being nasty -- good criticism stems from a keen understanding of what must be done in science to convince a broad audience of your ideas.

In case you want to start your own computer vision research group, I've collected some tips, tricks, and advice:

1. You don't need faculty.  If you can't find a season vision veteran to help you organize the event, do not worry.  You just need 3+ people interested in vision and the motivation to maintain weekly meetings.  Who cares if you don't understand every detail of every paper!  Nobody besides the authors will ever understand every detail.

2. Be fearless.  Ask dumb questions.  Alyosha Efros taught me that if you're reading a paper or listening to a presentation, if you don't understand something then there's a good chance you're not the only one in the audience with the same questions.  Sometimes younger PhD students are afraid of "asking a dumb question" in front of audience.  But if you love knowledge, then it is your duty to ask.  Silence will not get you far.  Be bold, be curious, and grow wise.  

3. Choose your own papers to present.  Do not present papers that others want you to present -- that is better left for a seminar course led by a faculty member.  In a reading group it is very important that you care about the problems you will be discussing with your peers.  If you keep up with this trend then when it comes to "paper writing time" you should be up to date on many relevant papers in your field and you will know about your other lab mates' research interests.

4. It is better to show a paper PDF up on a projector than cancel a meeting.  Even if everybody is busy, and the presenter didn't have time to create slides, it is important to keep the momentum going.

5. After a major conference, have all of the people who attended the conference present their "top K paper."  The week after CVPR it will be valuable to have such a massive vision brain dump onto your peers because it is unlikely that everybody got to attend. 

6. Book a room every week and try to have the meeting at the same time and place.  Have either the presenter or the reading group organizer send out an announcement with the paper they will be presenting ahead of time.  At MIT we share a google doc with the information about interesting papers and the upcoming presenter usually chooses the paper one week in advance so that the following week's presenter doesn't choose the same paper.  If somebody already presents your paper, don't do it a second time!  Choose another paper.  cvpapers.com is a great resource to find upcoming papers.

At CMU, there is a long rotating schedule which includes every vision student and faculty member.  Once it is your time to present, you can only get off the hook if you swap your slot with somebody else.  Being on a schedule months in advance means you'll have lots of time to prepare your slides.  At MIT, we are currently following the object recognition / scene understanding / object detection theme where we (Prof. Torralba, his students, his postdocs, his visiting students, etc) choose a paper highly relevant to our interests.  By keeping such a focus, we can really jump into the relevant details without having to explain fundamental concepts such as SVMs, features, etc.  However, at CMU the reading group is much broader because on the queue are students/profs interested in all aspects of vision and related fields such as graphics, illumination, geometry, learning, etc.