When humans grasp and manipulate objects, they almost invariably do so with the aid of vision. Visual information is used to locate and identify things, and to decide how (and if) they should be grasped. Visual feedback helps us guide our hands around obstacles and align them accurately with their goal. Hand-Eye Coordination gives us a flexibility and dexterity of movement that no machine can match.
Most vision systems for robotics usually need to be calibrated. Camera geometry - the focal length, principal point and aspect ratio of each camera [17], the relative position and orientation of the cameras (epipolar geometry) [15] and their relation to the robot coordinate system [16] - must be measured to a high degree of accuracy. A well-calibrated stereo rig can accurately determine the position and shape of things to be grasped in all three dimensions [14]. However, if calibration is erroneous or the cameras are disturbed, the system will usually fail gracelessly.
An alternative approach in hand-eye applications where a manipulator moves to a visually-specified target, is to use visual feedback to match manipulator and target positions in the image. Exact spatial coordinates are not required, and a well-chosen feedback architecture can correct for quite serious inaccuracies in camera calibration (as well as inaccurate kinematic modelling) [19].
In this paper we describe a system that combines stereo vision with a robotic manipulator to enable it to efficiently locate and reach simple unmodelled objects in an unstructured environment. The system is initially uncalibrated; it `calibrates' itself automatically by tracking the gripper during four deliberate exploratory movements in its workspace and is able to operate successfully in the presence of errors in the kinematics of the robot manipulator and unknown changes in the position, orientation and intrinsic parameters of the stereo cameras during operation.
The system exploits an affine stereo algorithm - a simple but robust approximation to the geometry of stereo vision - (described in section 2) which, though of modest accuracy, requires minimal calibration and can tolerate small camera movements. We show that, in some circumstances, this simplified camera model is less sensitive to image measurement error since it avoids computing parameters required in the full perspective stereo which are inherently ill-conditioned [4]. Closed-loop control is achieved by tracking the gripper's movements across the two images to estimate its position and orientation relative to the target object. This is done with a form of active contour model resembling a B-spline snake [3] but constrained to deform only affinely (described in section 3) to produce a more reliable tracker which is less easily confused by background contours or partial occlusion. Inevitable errors in aligning the gripper and target object position and orientation are corrected by an image-based feedback mechanism (section 4). Preliminary results of a realtime implementation are presented (section 5) and show the system to be remarkably immune to unexpected movements of the cameras and focal lengths even after the initial self-calibration.