-
Notifications
You must be signed in to change notification settings - Fork 48
Description
When trying to upload the datasheet of an LCSC component, the download process fails with [INFO] Warning: PDF download returned the wrong file type . Looking at the URL with curl shows a redirect HTTP/1.1 301 Moved Permanently. The lib requests in tools.py doesn't allow redirects by default. I enabled those and added a final URL check. Then iit kept failing: turns out that one of the redirects is an intermediate .html file that contains the final URL for the pdf. The code below handles the URL extraction form the intermediate. It could be prettified a bit, it's quite blunt right now.
if filetype == 'Image' or filetype == 'PDF':
# Enable use of requests library for downloading files (some URLs do NOT work with urllib)
if requests_lib:
# Follow Rediorect automatically
response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
# Check final URL after redirects
cprint(f'[INFO]\tFinal URL after redirects: {response.url}', silent=silent)
# Separate handling for lcsc.com URLs
if 'lcsc.com' in url:
if 'text/html' in response.headers.get('Content-Type', '').lower():
match = re.search(r']+src="([^"]+\.pdf)"', response.text)
if match:
pdf_url = match.group(1)
cprint(f"[INFO]\tExtracted PDF URL: {pdf_url}", silent=silent)
response = requests.get(pdf_url, headers=headers, timeout=timeout, allow_redirects=True)
else:
cprint("[INFO]\tWarning: No PDF URL found in the HTML content.", silent=silent)
return None
# Check if file is pdf
if 'application/pdf' not in response.headers.get('Content-Type', '').lower():
with open('debug_output.html', 'wb') as debug_file:
debug_file.write(response.content)
cprint(f"[INFO]\tWarning: The downloaded file is not a PDF. Content-Type: {response.headers.get('Content-Type')}", silent=silent)
cprint("[INFO]\tDebug: Saved response content to debug_output.html", silent=silent)
return None
# Save content to file
with open(fileoutput, 'wb') as file:
file.write(response.content)
cprint(f'[INFO]\tDownload success: {fileoutput}', silent=silent)
return fileoutput
elif try_cloudscraper:
response = get_image_with_retries(url, headers=headers)
if filetype.lower() not in response.headers.get('Content-Type', '').lower():
cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
return None
with open(fileoutput, 'wb') as file:
file.write(response.content)
else:
(file, headers) = urllib.request.urlretrieve(url, filename=fileoutput)
if filetype.lower() not in headers['Content-Type'].lower():
cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
return None
return file
It works as intended, but let me know if you prefer a different approach. I could check for lcsc.com in the url sooner (as its done for www.ti.com), but I'm not sure all datasheet have been ported to the new platform yet, so I just added conditions to avoid breaking older setups.