LCSC datasheet fetch fails due to redirect

When trying to upload the datasheet of an LCSC component, the download process fails with `[INFO]  Warning: PDF download returned the wrong file type` . Looking at the URL with curl shows a redirect `HTTP/1.1 301 Moved Permanently`. The lib `requests` in tools.py doesn't allow redirects by default. I enabled those and added a final URL check. Then iit kept failing: turns out that one of the redirects is an intermediate .html file that contains the final URL for the pdf. The code below handles the URL extraction form the intermediate. It could be prettified a bit, it's quite blunt right now. 
<pre>if filetype == 'Image' or filetype == 'PDF':
            # Enable use of requests library for downloading files (some URLs do NOT work with urllib)
            if requests_lib:
                # Follow Rediorect automatically
                response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
                
                # Check final URL after redirects
                cprint(f'[INFO]\tFinal URL after redirects: {response.url}', silent=silent)
                
                # Separate handling for lcsc.com URLs
                if 'lcsc.com' in url:
                    if 'text/html' in response.headers.get('Content-Type', '').lower():
                        match = re.search(r'<iframe[^>]+src="([^"]+\.pdf)"', response.text)
                        if match:
                            pdf_url = match.group(1)
                            cprint(f"[INFO]\tExtracted PDF URL: {pdf_url}", silent=silent)
                            response = requests.get(pdf_url, headers=headers, timeout=timeout, allow_redirects=True)
                        else:
                            cprint("[INFO]\tWarning: No PDF URL found in the HTML content.", silent=silent)
                            return None

                # Check if file is pdf
                if 'application/pdf' not in response.headers.get('Content-Type', '').lower():
                    with open('debug_output.html', 'wb') as debug_file:
                        debug_file.write(response.content)
                    cprint(f"[INFO]\tWarning: The downloaded file is not a PDF. Content-Type: {response.headers.get('Content-Type')}", silent=silent)
                    cprint("[INFO]\tDebug: Saved response content to debug_output.html", silent=silent)
                    return None
                
                # Save content to file
                with open(fileoutput, 'wb') as file:
                    file.write(response.content)
                
                cprint(f'[INFO]\tDownload success: {fileoutput}', silent=silent)
                return fileoutput
            elif try_cloudscraper:
                response = get_image_with_retries(url, headers=headers)
                if filetype.lower() not in response.headers.get('Content-Type', '').lower():
                    cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
                    return None
                with open(fileoutput, 'wb') as file:
                    file.write(response.content)
            else:
                (file, headers) = urllib.request.urlretrieve(url, filename=fileoutput)
                if filetype.lower() not in headers['Content-Type'].lower():
                    cprint(f'[INFO]\tWarning: {filetype} download returned the wrong file type', silent=silent)
                    return None
            return file</pre>


It works as intended, but let me know if you prefer a different approach. I could check for lcsc.com in the url sooner (as its done for www.ti.com), but I'm not sure all datasheet have been ported to the new platform yet, so I just added conditions to avoid breaking older setups.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCSC datasheet fetch fails due to redirect #280

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LCSC datasheet fetch fails due to redirect #280

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions