Still not a clear and obvious solution to this problem. Currently leaning towarding using an avro file to record file name and md5 hash or adding md5 hash column to the existing parquet file and updating it.