Make built-in adapters' identifiers configurable#247
Make built-in adapters' identifiers configurable#247lafrenierejm wants to merge 4 commits intophiresky:masterfrom
Conversation
b53f171 to
24f57ef
Compare
82fb415 to
80cbb9d
Compare
|
@phiresky This is ready for your review whenever you get the chance. |
|
@phiresky Bumping the request for review. |
|
In general this seems like probably a good idea, but I'm not sure about the approach?
Maybe it would be better and simpler to use a syntax like As in, you specify pairs of extensions [a,b] and every file with extension a is treated as if it had extension b. That way you also don't the additional mapping of how the adapter should treat the file internally (only relevant for the The only change that would be needed is that the "fake" extension needs to be given to the adapter. Since some files given to an adapter already don't actually exist on the FS (e.g. within zips), this can potentially be done by just changing /// file path. May not be an actual file on the file system (e.g. in an archive). Used for matching file extensions.
pub filepath_hint: PathBuf,Then no other changes are required per adapter, and the override also works to temporarily override extensions of custom adapters a user has configured. |
|
Going the complete other way: This problem seems to really only have appeared for But probably the general solution above is better |
9b45e19 to
982b233
Compare
Settings like Anyway sorry for my delay for review but it is always happy to see things are growing :) |
|
sqlite databases can also be named custom things. vscode names them things like "state.vscdb" -- I'm trying to extract my Cursor LLM conversations for example, but I don't know which workspace has what uuid. rga seemed like a great fit, but the extension issue cropped up. |
|
Great example of extension remapping also being useful for other purposes . @lafrenierejm would you be willing to update/rewrite your implementation to use rga --rga-additional-extensions jar=zip,xlsx=zip,vscdb=sqlite3 and in the config file: or instead of modifying individual adapters? @perplexes: note that --rga-accurate should work for your case though |
Certainly! I think I will have time to do so within the next week or so, but I can't promise that. |
982b233 to
93fc5ab
Compare
This will allow end users to provide their own lists of extensions and/or mimetypes for each of the built-in adapters.
93fc5ab to
5308890
Compare
@phiresky The initial refactor for this is done. I went ahead and exposed mimetypes in addition to extensions. I named the options I haven't implemented thorough tests yet. That should be done before this PR is considered ready for merge. |
|
Ha, I was just making a feature request for this, and found the PR while looking for similar issues. Thank you for working on this! However, looking at the diff, I could not figure if more complex extensions would be supported by this change, such as Original feature requestIs your feature request related to a problem? Please describe. Describe the solution you'd like
Describe alternatives you've considered Additional context |
This will allow end users to provide their own lists of extensions and/or mimetypes for each of the built-in adapters.
This feature would obsolete the need for feature requests such as:
.alsfile support #185The functionality proposed here is a superset of that in #244. That PR makes only the Zip adapter's extensions configurable, whereas this exposes the extensions and mimetypes of all built-in adapters for end-user configurability.
Output of
cargo run --bin=rga -- --rga-print-config-schemafrom this branch.{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "rga configuration", "description": "this is kind of a \"polyglot\" struct, since it serves three functions\n\n1. describing the command line arguments using structopt+clap and for man page / readme generation 2. describing the config file format (output as JSON schema via schemars)", "type": "object", "properties": { "accurate": { "description": "Use more accurate but slower matching by mime type\n\nBy default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).", "type": "boolean" }, "adapters": { "description": "Change which adapters to use and in which priority order (descending)\n\n\"foo,bar\" means use only adapters foo and bar. \"-bar,baz\" means use all default adapters except for bar and baz. \"+bar,baz\" means use all default adapters and also bar and baz.", "type": "array", "items": { "type": "string" } }, "cache": { "$ref": "#/definitions/CacheConfig" }, "max_archive_recursion": { "description": "Maximum nestedness of archives to recurse into\n\nWhen searching in archives, rga will recurse into archives inside archives. This option limits the depth.", "allOf": [ { "$ref": "#/definitions/MaxArchiveRecursion" } ] }, "no_prefix_filenames": { "description": "Don't prefix lines of files within archive with the path inside the archive.\n\nInside archives, by default rga prefixes the content of each file with the file path within the archive. This is usually useful, but can cause problems because then the inner path is also searched for the pattern.", "type": "boolean" }, "custom_adapters": { "type": [ "array", "null" ], "items": { "$ref": "#/definitions/CustomAdapterConfig" } }, "custom_identifiers": { "anyOf": [ { "$ref": "#/definitions/CustomIdentifiers" }, { "type": "null" } ] } }, "definitions": { "CacheConfig": { "type": "object", "properties": { "disabled": { "description": "Disable caching of results\n\nBy default, rga caches the extracted text, if it is small enough, to a database in ${XDG_CACHE_DIR-~/.cache}/ripgrep-all on Linux, ~/Library/Caches/ripgrep-all on macOS, or C:\\Users\\username\\AppData\\Local\\ripgrep-all on Windows. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.", "type": "boolean" }, "max_blob_len": { "description": "Max compressed size to cache\n\nLongest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time.\n\nAllowed suffixes on command line: k M G", "allOf": [ { "$ref": "#/definitions/CacheMaxBlobLen" } ] }, "compression_level": { "description": "ZSTD compression level to apply to adapter outputs before storing in cache db\n\nRanges from 1 - 22", "allOf": [ { "$ref": "#/definitions/CacheCompressionLevel" } ] }, "path": { "description": "Path to store cache db", "allOf": [ { "$ref": "#/definitions/CachePath" } ] } } }, "CacheMaxBlobLen": { "type": "integer", "format": "uint", "minimum": 0.0 }, "CacheCompressionLevel": { "type": "integer", "format": "int32" }, "CachePath": { "type": "string" }, "MaxArchiveRecursion": { "type": "integer", "format": "int32" }, "CustomAdapterConfig": { "type": "object", "required": [ "args", "binary", "description", "extensions", "mimetypes", "name", "version" ], "properties": { "name": { "description": "the unique identifier and name of this adapter. Must only include a-z, 0-9, _", "type": "string" }, "description": { "description": "a description of this adapter. shown in help", "type": "string" }, "disabled_by_default": { "description": "if true, the adapter will be disabled by default", "type": [ "boolean", "null" ] }, "version": { "description": "version identifier. used to key cache entries, change if the configuration or program changes", "type": "integer", "format": "int32" }, "extensions": { "description": "the file extensions this adapter supports. For example [\"epub\", \"mobi\"]", "type": "array", "items": { "type": "string" } }, "mimetypes": { "description": "if not null and --rga-accurate is enabled, mime type matching is used instead of file name matching", "type": "array", "items": { "type": "string" } }, "match_only_by_mime": { "description": "if --rga-accurate, only match by mime types, ignore extensions completely", "type": [ "boolean", "null" ] }, "binary": { "description": "the name or path of the binary to run", "type": "string" }, "args": { "description": "The arguments to run the program with. Placeholders: - $input_file_extension: the file extension (without dot). e.g. foo.tar.gz -> gz - $input_file_stem, the file name without the last extension. e.g. foo.tar.gz -> foo.tar - $input_virtual_path: the full input file path. Note that this path may not actually exist on disk because it is the result of another adapter\n\nstdin of the program will be connected to the input file, and stdout is assumed to be the converted file", "type": "array", "items": { "type": "string" } }, "output_path_hint": { "description": "The output path hint. The placeholders are the same as for `.args`\n\nIf not set, defaults to \"${input_virtual_path}.txt\"\n\nSetting this is useful if the output format is not plain text (.txt) but instead some other format that should be passed to another adapter", "type": [ "string", "null" ] } } }, "CustomIdentifiers": { "type": "object", "properties": { "bz2": { "description": "The identifiers to process as bz2 archives", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "ffmpeg": { "description": "The identifiers to process via ffmpeg", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "gz": { "description": "The identifiers to process as gz archives", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "xz": { "description": "The identifiers to process as xz archives", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "zip": { "description": "The identifiers to process as zip archives", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "zst": { "description": "The identifiers to process as zst archives", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] }, "mbox": { "description": "The identifiers to process as mbox files", "anyOf": [ { "$ref": "#/definitions/CustomIdentifier" }, { "type": "null" } ] } } }, "CustomIdentifier": { "type": "object", "properties": { "extensions": { "description": "The file extensions this adapter supports, for example `[\"gz\", \"tgz\"]`.", "type": [ "array", "null" ], "items": { "type": "string" } }, "mimetypes": { "description": "If not null and --rga-accurate is enabled, mimetype matching is used instead of file name matching.", "type": [ "array", "null" ], "items": { "type": "string" } } } } } }