preview

Rethinkrobot Lab

Decent Essays

Our experiments are performed on a humanoid robot Baxter from RethinkRobot [cite?]. The arm has 7 DoF and 2 grippers in the end, and can be controlled by position, velocity or effort at the endpoint at 100 Hz. In the simulator Gazebo, the world is initialized with table, plates and wood board for the tasks, and the tools (eg. fork, knife, brush) are initialized in left hand gripper to perform the action sequence.

In the execution cycle of our experiments, Baxter can execute 5 action primitives: Stay(S), Left(L), Right(R), Forward(F), backward(B), and each moves one step unit \delta (\delta=0.08m ). We control the arm of Baxter by sending the desired end effector to inverse kinematics (IK) solver to obtain the joints angles (7 joints: left_s0, s1, e0, e1, w0, w1, w2) for joint positional control. The end effector position may be slightly unstable though IK is using an initialized neutral arm position as the seed. Actions are executed at 2 Hz and image frames are refreshed at the same frequency. At each time step, the cropped RGB image from head camera is resized to 80\times80 …show more content…

Then DQN takes it as input and generates action that maximizes the long term expected rewards. Baxter follows \epsilon -greedy policy (\epsilon\in[0.05,0.5] decreasing with episodes as [#DQN:mnih2013playing]) till an action length time-out, which roughly takes 10 seconds. Robot is performing a “test” episode every 50 episodes, by following learned policy exactly without random exploration. Once each episode finished, the cached image frames are fed in CNN-LSTM to obtain the terminal reward and the robot is reset to the neutral position. All the joints, end-effector states, values and rewards tuples (s,a,r,s',a') are stored in replay memory for training, and “test” episodes are recorded for later evaluation. The replay memory is of size 200k due to machine memory constraints. We then randomly sampled 50k tuples from replay memory to train the DQN during each training

Get Access