- [ ] noise cleanup? - [ ] de-duplication? - [ ] bibliographical data detection? - [ ] topic classification?