Project 2 is posted. It is about doing camera-based photogrammetry using linear methods. By the end you’ll have a model of a cathedral spinning around in plot3—not bad for a few lines of Matlab code!
Camera image files are now available in CMS and linked from the assignment section of the course website. The nd_images directory in the provided zip file should be extracted to the main directory containing the framework .m files. These images are required for the LinearCamera.plot_reprojections() function of the provided framework.
For the triangulate function, are we supposed to be creating and solving one linear least squares system, or multiple linear least squares system? For the case where several points are seen by the same set of cameras, it is clear how to combine all of the equations into one linear least squares system, but if you have points that are not visible by the same subset of cameras, there seems to be no easy way to combine everything into one system.
You are correct that when points are seen by several sets of cameras, it is not simple to make this into a single system. It’s fine to loop over the points in this case.
So are you saying that we should first solve a linear least squares system for the subset of points that are all visible by the same set of cameras, and then solve additional linear least squares systems for points that are visible by different subsets of cameras, meaning we solve more than one linear least squares system?
For the Structure from Motion method, what exactly is whichPoints for? The writeup says that Y is a matrix containing all of the observations, one for every point in every camera. If we include the observations for every point in every camera, why do we need whichPoints at all?
Should the points returned by the triangulate function have similar values to the corresponding points in the points matrix returned by generate_problem?
@sac76: whichPoints tells you the set of points visible to all the cameras passed in the “cameras” array. Y should contain a subset of all the points such that each camera sees every point the subset (pp and pc will tell you what points and cameras make this full visibility subset)
@sln27:
1) As mentioned in the previous post pp and pc are the “popular points” and “popular cameras” (i.e. the points and cameras that make up the full visibility subset you would want to use to construct the Y matrix)
2) If you have no noise then you should be able to recover the same coordinates to within machine precision.
@bjh: You are right, that is indeed a bug which would prevent reprojection of the points. A new version with the fix is on CMS. Thanks for pointing this out!
When we try generate_problem, Matlab says “The class LinearCamera has no property or method named ‘empty’. Indeed, there is no method named empty in LinearCamera. How do we run generate_problem?
…well, we commented out line 7 in generate_problem, and it seemed to work fine. It seems like calling LinearCamera.empty was just preallocating, but will this mess something up?
using [points, cameras] = generate_problem(10, 10, 1, 10 10, 0) should give me 10 cameras and 10 points such that all cameras see all points.
So if i call structure_from_motion(cameras, 1:10) , I should get back a set of 10 points that match the points from generate function exactly, right?
In structure from motion we are finding two matrices M and X such that M * X = Y. What guarantee is there that the X matrix returned by using the strategy in homework 8 will match the values of the points matrix from generate_problem exactly?
Since the variables in an object are not set to private, is it ok if we access and manipulate them directly instead of using the getter and setter functions?
How do we call the set functions on linear camera? We have our M and t, and could set them directly for a camera, but feel like we are supposed to use the functions. We’ve poured through documentation though and can’t seem to find a valid way to call set.
@Michael Hurd: empty() is indeed a preallocation static function for object arrays in Matlab (so it doesn’t need to be declared in LinearCamera), meaning that you may be having trouble due to an older version of Matlab? Removing the line should not have any other effect on the code.
@sac76: There is no guarantee that the points you get out of the SVD factorization will match the original points (due to the remaining ambiguity as described in the relevant section of the handout). There should however exist a matrix A such that (Mtilde*A)*(A^1*Xtilde)= M*X = Y, meaning you can solve for such a matrix and see if your factored points can be aligned with the original points.
@fk: Actually the getter and setter functions are automatically called when you try to access those variables so you should just go ahead and use the typical “cam.M” etc. notation.
Running bootstrap on a generated problem doesn’t give me, after iterations, a solution close to the original. I’ve checked my triangulate and calibrate functions. However, the rank of MX = Y in structure from motion can be a value as great as 4? So there is some loss when I factor Y into 2*m x 3 and 3 x n. I don’t see subsequent iterations bringing my values closer to the original generated values.
In addition, triangulate is very slow because it loops over all the cameras and all the points. Are there any tips to speed up the code?
@tl: The SVD structure from motion method will not in general return points which align with your original points (for that you need to resolve the ambiguity as described in the handout and then also figure out an arbitrary rotation to align the entire system). To speed up triangulate() you may want to construct the entire linear system with camera matrices and projections once at the beginning and then pick out relevant parts when solving for each point.
For the extra credit part, the system we are solving for is overdetermined, so we will get a LLS solution for the entries of A. Doesn’t this mean that when we go back and get M and X using A the points will not be orthonormal, just close to it? The writeup says the will be exactly orthonormal, which is correct?
@kslice: You’re right that A will not enforce complete orthonormality for the camera matrices but only get us as close as possible, which is what the writeup itself also says: “The idea is to minimize a least squares measure that encourages the rows of M to be unit length and orthonormal.”
@kslice: The recovered M and X will be only approximately orthonormal; as the writeup says, we are minimizing a least-squares measure that encourages the rows of M to be unit length and pairwise orthogonal.
A few people have been asking about the expected performance of the bootstrap process so here are a few clarifications:
- The entire bootstrapping process shouldn’t really take longer than a few minutes (definitely not more than 15 on reasonable hardware)
- If you’re having issues with taking too much time then you may be adding in too few cameras and points per iteration. The system is quite resilient to adding a large number of new cameras and points per iteration so try to do that as a quick test (actually, one could triangulate all the unknown points and calibrate all the uncalibrated cameras in a single iteration and get some rough results that are quite acceptable)
- If you need to have stopping criteria for your triangulation-calibration iterations then a reasonable choice is to stop when the change in total reconstruction error of the system falls below a small threshold (you can add up all the LinearCamera.reprojection_error() results to get the total error)
Also, how 3d is the image supposed to look? Although I can rotate my final image in plot3, it doesn’t have much structural definition along the z-axis. It just looks like a thicker picture.
@fk: The point cloud will be mostly two dimensional because the majority of the features are on the front face of the cathedral. You can qualitatively see what a front view of the point cloud should look like here: http://blogs.cornell.edu/cs3220/files/2010/04/nd.jpg
(this result is definitely not a gold standard by which to judge your own but it should give you an idea of what to expect)
I am confused as to what exactly we are supposed to do for structure_from_motion. After we form the Y matrix do we need to subtract the mean of each row from that particular row and then find the svd in order to get the Mtilde and Xtilde matrices? Or am I reading that wrong and we are supposed to do something else?
@noob: Yes, you have the right idea. The average of the 3D points will be zero, so the average of the 2D points for one camera is M_i(0) + t_i = t_i. Subtracting t from each M_i x_j + t_i = y_ij equation leaves M_i x_j = y_ij, and when we stack all these together into M X = Y then we have something we can factor using SVD.
@student: Yes, you use pc and pp to select a set of cameras and points to use in structure_from_motion. The function arguments are an array of cameras (so you’ll use indexing with pc to select the cameras to pass in) and an array of indices of points (so you’ll just pass in pp). Inside the function, since each camera contains an array holding all the 2D points it saw, you’ll use indexing with pp to pull out the relevant data to assemble Y.
Handout now updated to version 1.1, to fix some errant ms and ns.
Camera image files are now available in CMS and linked from the assignment section of the course website. The nd_images directory in the provided zip file should be extracted to the main directory containing the framework .m files. These images are required for the LinearCamera.plot_reprojections() function of the provided framework.
For the triangulate function, are we supposed to be creating and solving one linear least squares system, or multiple linear least squares system? For the case where several points are seen by the same set of cameras, it is clear how to combine all of the equations into one linear least squares system, but if you have points that are not visible by the same subset of cameras, there seems to be no easy way to combine everything into one system.
You are correct that when points are seen by several sets of cameras, it is not simple to make this into a single system. It’s fine to loop over the points in this case.
@srm
So are you saying that we should first solve a linear least squares system for the subset of points that are all visible by the same set of cameras, and then solve additional linear least squares systems for points that are visible by different subsets of cameras, meaning we solve more than one linear least squares system?
Thanks.
For the Structure from Motion method, what exactly is whichPoints for? The writeup says that Y is a matrix containing all of the observations, one for every point in every camera. If we include the observations for every point in every camera, why do we need whichPoints at all?
When you load notre_dame.mat, what are pp and pc?
Should the points returned by the triangulate function have similar values to the corresponding points in the points matrix returned by generate_problem?
@sac76: whichPoints tells you the set of points visible to all the cameras passed in the “cameras” array. Y should contain a subset of all the points such that each camera sees every point the subset (pp and pc will tell you what points and cameras make this full visibility subset)
@sln27:
1) As mentioned in the previous post pp and pc are the “popular points” and “popular cameras” (i.e. the points and cameras that make up the full visibility subset you would want to use to construct the Y matrix)
2) If you have no noise then you should be able to recover the same coordinates to within machine precision.
I think there’s a bug in the framework.
Line 34 in LinearCamera.m should be
d= size(X,2)
not
d= size(X,1)
This is in the function project_points
@bjh: You are right, that is indeed a bug which would prevent reprojection of the points. A new version with the fix is on CMS. Thanks for pointing this out!
When we try generate_problem, Matlab says “The class LinearCamera has no property or method named ‘empty’. Indeed, there is no method named empty in LinearCamera. How do we run generate_problem?
…well, we commented out line 7 in generate_problem, and it seemed to work fine. It seems like calling LinearCamera.empty was just preallocating, but will this mess something up?
using [points, cameras] = generate_problem(10, 10, 1, 10 10, 0) should give me 10 cameras and 10 points such that all cameras see all points.
So if i call structure_from_motion(cameras, 1:10) , I should get back a set of 10 points that match the points from generate function exactly, right?
In structure from motion we are finding two matrices M and X such that M * X = Y. What guarantee is there that the X matrix returned by using the strategy in homework 8 will match the values of the points matrix from generate_problem exactly?
Since the variables in an object are not set to private, is it ok if we access and manipulate them directly instead of using the getter and setter functions?
How do we call the set functions on linear camera? We have our M and t, and could set them directly for a camera, but feel like we are supposed to use the functions. We’ve poured through documentation though and can’t seem to find a valid way to call set.
@Michael Hurd: empty() is indeed a preallocation static function for object arrays in Matlab (so it doesn’t need to be declared in LinearCamera), meaning that you may be having trouble due to an older version of Matlab? Removing the line should not have any other effect on the code.
@sac76: There is no guarantee that the points you get out of the SVD factorization will match the original points (due to the remaining ambiguity as described in the relevant section of the handout). There should however exist a matrix A such that (Mtilde*A)*(A^1*Xtilde)= M*X = Y, meaning you can solve for such a matrix and see if your factored points can be aligned with the original points.
@fk: Actually the getter and setter functions are automatically called when you try to access those variables so you should just go ahead and use the typical “cam.M” etc. notation.
@kaf: See previous post. You would do cam.M = newM , for example.
For the bonus question, I think that it is supposed to be a 3*m by 6 linear system, not 3*n by 6.
Running bootstrap on a generated problem doesn’t give me, after iterations, a solution close to the original. I’ve checked my triangulate and calibrate functions. However, the rank of MX = Y in structure from motion can be a value as great as 4? So there is some loss when I factor Y into 2*m x 3 and 3 x n. I don’t see subsequent iterations bringing my values closer to the original generated values.
In addition, triangulate is very slow because it loops over all the cameras and all the points. Are there any tips to speed up the code?
@Daniel: Yes, you’re right, that is 3m, since m is the number of cameras.
@tl: The SVD structure from motion method will not in general return points which align with your original points (for that you need to resolve the ambiguity as described in the handout and then also figure out an arbitrary rotation to align the entire system). To speed up triangulate() you may want to construct the entire linear system with camera matrices and projections once at the beginning and then pick out relevant parts when solving for each point.
For the extra credit part, the system we are solving for is overdetermined, so we will get a LLS solution for the entries of A. Doesn’t this mean that when we go back and get M and X using A the points will not be orthonormal, just close to it? The writeup says the will be exactly orthonormal, which is correct?
@kslice: You’re right that A will not enforce complete orthonormality for the camera matrices but only get us as close as possible, which is what the writeup itself also says: “The idea is to minimize a least squares measure that encourages the rows of M to be unit length and orthonormal.”
@kslice: The recovered M and X will be only approximately orthonormal; as the writeup says, we are minimizing a least-squares measure that encourages the rows of M to be unit length and pairwise orthogonal.
A few people have been asking about the expected performance of the bootstrap process so here are a few clarifications:
- The entire bootstrapping process shouldn’t really take longer than a few minutes (definitely not more than 15 on reasonable hardware)
- If you’re having issues with taking too much time then you may be adding in too few cameras and points per iteration. The system is quite resilient to adding a large number of new cameras and points per iteration so try to do that as a quick test (actually, one could triangulate all the unknown points and calibrate all the uncalibrated cameras in a single iteration and get some rough results that are quite acceptable)
- If you need to have stopping criteria for your triangulation-calibration iterations then a reasonable choice is to stop when the change in total reconstruction error of the system falls below a small threshold (you can add up all the LinearCamera.reprojection_error() results to get the total error)
Could someone post a screenshot of what the final 3d image is supposed to look like? That would be extremely helpful.
Thanks!
Also, how 3d is the image supposed to look? Although I can rotate my final image in plot3, it doesn’t have much structural definition along the z-axis. It just looks like a thicker picture.
Thanks.
@fk: The point cloud will be mostly two dimensional because the majority of the features are on the front face of the cathedral. You can qualitatively see what a front view of the point cloud should look like here: http://blogs.cornell.edu/cs3220/files/2010/04/nd.jpg
(this result is definitely not a gold standard by which to judge your own but it should give you an idea of what to expect)
I am confused as to what exactly we are supposed to do for structure_from_motion. After we form the Y matrix do we need to subtract the mean of each row from that particular row and then find the svd in order to get the Mtilde and Xtilde matrices? Or am I reading that wrong and we are supposed to do something else?
Hi. I am confused about how to get started on bootstrap. would I first call structure_from_motion by passing in parameters pp and pc?
@noob: Yes, you have the right idea. The average of the 3D points will be zero, so the average of the 2D points for one camera is M_i(0) + t_i = t_i. Subtracting t from each M_i x_j + t_i = y_ij equation leaves M_i x_j = y_ij, and when we stack all these together into M X = Y then we have something we can factor using SVD.
@student: Yes, you use pc and pp to select a set of cameras and points to use in structure_from_motion. The function arguments are an array of cameras (so you’ll use indexing with pc to select the cameras to pass in) and an array of indices of points (so you’ll just pass in pp). Inside the function, since each camera contains an array holding all the 2D points it saw, you’ll use indexing with pp to pull out the relevant data to assemble Y.