Hi, If we are crawling web pages with images and videos then how we should be extending this system to store images and videos of crawled website (storage type etc.)? Can we use AWS S3 (or any other object store db)?
Since each element in a HTML page is also identified by a URL, that can be fetch using HTTP protocol, we can configure our crawler to read the HTTP header first and check the
contet-type and based on the type of this content, store it in a different place or perform a different type of processing. Of course, downloading videos and other medias contents would increase the storage and processing time.
To store this kind of data, I don’t see any problem by using AWS S3, this storage is good to cache its contents in CDNs and if the crawler will be used for some other application that consumes videos you downloaded, it can helps. Also, you can use HDFS, a distributed file system used by Hadoop, which is very fast and redundant and you can perform some batch processing in order to extract metadata, thumbnails, etc.
Hope this helps!