Aiming at the problem of effectively learning multi granularity spatio-temporal representations of data in the context of multi event detection in video media, a fusion mechanism combining ...