Skip to content

LCSC datasheet fetch fails due to redirect #280

@Keybored02

Description

@Keybored02

When trying to upload the datasheet of an LCSC component, the download process fails with [INFO] Warning: PDF download returned the wrong file type . Looking at the URL with curl shows a redirect HTTP/1.1 301 Moved Permanently. The lib requests in tools.py doesn't allow redirects by default. I enabled those and added a final URL check. Then iit kept failing: turns out that one of the redirects is an intermediate .html file that contains the final URL for the pdf. The code below handles the URL extraction form the intermediate. It could be prettified a bit, it's quite blunt right now.

if filetype == 'Image' or filetype == 'PDF':
            # Enable use of requests library for downloading files (some URLs do NOT work with urllib)
            if requests_lib:
                # Follow Rediorect automatically
                response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
                
                # Check final URL after redirects
                cprint(f'[INFO]\tFinal URL after redirects: {response.url}', silent=silent)
                
                # Separate handling for lcsc.com URLs
                if 'lcsc.com' in url:
                    if 'text/html' in response.headers.get('Content-Type', '').lower():
                        match = re.search(r']+src="([^"]+\.pdf)"', response.text)
                        if match:
                            pdf_url = match.group(1)
                            cprint(f"[INFO]\tExtracted PDF URL: {pdf_url}", silent=silent)
                            response = requests.get(pdf_url, headers=headers, timeout=timeout, allow_redirects=True)
                        else:
                            cprint("[INFO]\tWarning: No PDF URL found in the HTML content.", silent=silent)
                            return None

                # Check if file is pdf
                if 'application/pdf' not in response.headers.get('Content-Type', '').lower():
                    with open('debug_output.html', 'wb') as debug_file:
                        debug_file.write(response.content)
                    cprint(f"[INFO]\tWarning: The downloaded file is not a PDF. Content-Type: {response.headers.get('Content-Type')}", silent=silent)
                    cprint("[INFO]\tDebug: Saved response content to debug_output.html", silent=silent)
                    return None
                
                # Save content to file
                with open(fileoutput, 'wb') as file:
                    file.write(response.content)
                
                cprint(f'[INFO]\tDownload success: {fileoutput}', silent=silent)
                return fileoutput
            elif try_cloudscraper:
                response = get_image_with_retries(url, headers=headers)
                if filetype.lower() not in response.headers.get('Content-Type', '').lower():
                    cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
                    return None
                with open(fileoutput, 'wb') as file:
                    file.write(response.content)
            else:
                (file, headers) = urllib.request.urlretrieve(url, filename=fileoutput)
                if filetype.lower() not in headers['Content-Type'].lower():
                    cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
                    return None
            return file

It works as intended, but let me know if you prefer a different approach. I could check for lcsc.com in the url sooner (as its done for www.ti.com), but I'm not sure all datasheet have been ported to the new platform yet, so I just added conditions to avoid breaking older setups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions