Skip to content

Index error in docker-compose article-relevance-prediction. #103

@SimonGoring

Description

@SimonGoring

Running the docker compose in the root directory I am now running into a new error with indices:

simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction
Starting metaextractor_article-relevance-prediction_1 ... done
Attaching to metaextractor_article-relevance-prediction_1
article-relevance-prediction_1  | 2023-07-24 18:04:08,781 - gdd_api_query.py:113 - get_new_gdd_articles - INFO - Querying by n_recent = 1000
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:151 - get_new_gdd_articles - INFO - 1000 articles queried from GeoDeepDive (page 1).
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:174 - get_new_gdd_articles - INFO - GeoDeepDive query completed.
article-relevance-prediction_1  | 2023-07-24 18:04:09,854 - gdd_api_query.py:197 - get_new_gdd_articles - INFO - 1000 articles returned from GeoDeepDive.
article-relevance-prediction_1  | 2023-07-24 18:04:12,763 - relevance_prediction_parquet.py:57 - crossref_extract - INFO - Running crossref_extract function.
article-relevance-prediction_1  | 2023-07-24 18:04:12,766 - relevance_prediction_parquet.py:77 - crossref_extract - INFO - Querying CrossRef API for article metadata.
article-relevance-prediction_1  | 2023-07-24 18:10:54,843 - relevance_prediction_parquet.py:98 - crossref_extract - INFO - CrossRef API query completed for 1000 articles.
article-relevance-prediction_1  | 2023-07-24 18:10:54,877 - relevance_prediction_parquet.py:164 - data_preprocessing - INFO - Prediction data preprocessing begin.
article-relevance-prediction_1  | /app/src/article_relevance/relevance_prediction_parquet.py:178: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
article-relevance-prediction_1  |   metadata_df.loc[valid_condition, 'has_abstract'] = metadata_df.loc[valid_condition, "abstract"].isnull()
article-relevance-prediction_1  | 2023-07-24 18:10:54,896 - relevance_prediction_parquet.py:189 - data_preprocessing - INFO - Running article language imputation.
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:201 - data_preprocessing - INFO - 81 articles require language imputation
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:203 - data_preprocessing - INFO - 81 cannot be imputed due to too short text metadata(title, subtitle and abstract less than 5 character).
article-relevance-prediction_1  | 2023-07-24 18:10:54,905 - relevance_prediction_parquet.py:213 - data_preprocessing - INFO - Missing language imputation completed
article-relevance-prediction_1  | 2023-07-24 18:10:54,906 - relevance_prediction_parquet.py:214 - data_preprocessing - INFO - After imputation, there are 1000 non-English articles in total excluded from the prediction pipeline.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:238 - data_preprocessing - INFO - 0 articles has missing feature and its relevance cannot be predicted.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:239 - data_preprocessing - INFO - Data preprocessing completed.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:257 - add_embeddings - INFO - Sentence embedding start.
Downloading (…)2c72f/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 3.53MB/s]
Downloading (…)be7662c72f/README.md: 100%|██████████| 8.09k/8.09k [00:00<00:00, 22.2MB/s]
Downloading (…)7662c72f/config.json: 100%|██████████| 754/754 [00:00<00:00, 2.62MB/s]
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [00:10<00:00, 40.7MB/s] 
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 810kB/s]
Downloading (…)2c72f/tokenizer.json: 100%|██████████| 717k/717k [00:00<00:00, 5.29MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.59MB/s]
Downloading (…)be7662c72f/vocab.txt: 100%|██████████| 228k/228k [00:00<00:00, 3.56MB/s]
article-relevance-prediction_1  | No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/allenai_specter2. Creating a new one with MEAN pooling.
article-relevance-prediction_1  | 2023-07-24 18:11:09,041 - relevance_prediction_parquet.py:275 - add_embeddings - INFO - Sentence embedding completed.
article-relevance-prediction_1  | 2023-07-24 18:11:09,050 - relevance_prediction_parquet.py:294 - relevance_prediction - INFO - Prediction start.
article-relevance-prediction_1  | 2023-07-24 18:11:09,064 - relevance_prediction_parquet.py:307 - relevance_prediction - INFO - Running prediction for 0 articles.
article-relevance-prediction_1  | Traceback (most recent call last):
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 456, in <module>
article-relevance-prediction_1  |     main()
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 445, in main
article-relevance-prediction_1  |     predicted = relevance_prediction(embedded, model_path, predict_thld = 0.5)
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 311, in relevance_prediction
article-relevance-prediction_1  |     nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
article-relevance-prediction_1  |     return self._getitem_tuple(key)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1256, in _getitem_tuple
article-relevance-prediction_1  |     return self._getitem_tuple_same_dim(tup)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 924, in _getitem_tuple_same_dim
article-relevance-prediction_1  |     retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
article-relevance-prediction_1  |     return self._getitem_iterable(key, axis=axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
article-relevance-prediction_1  |     keyarr, indexer = self._get_listlike_indexer(key, axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
article-relevance-prediction_1  |     keyarr, indexer = ax._get_indexer_strict(key, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
article-relevance-prediction_1  |     self._raise_if_missing(keyarr, indexer, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6133, in _raise_if_missing
article-relevance-prediction_1  |     raise KeyError(f"{not_found} not in index")
article-relevance-prediction_1  | KeyError: "['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578', '579', '580', '581', '582', '583', '584', '585', '586', '587', '588', '589', '590', '591', '592', '593', '594', '595', '596', '597', '598', '599', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '610', '611', '612', '613', '614', '615', '616', '617', '618', '619', '620', '621', '622', '623', '624', '625', '626', '627', '628', '629', '630', '631', '632', '633', '634', '635', '636', '637', '638', '639', '640', '641', '642', '643', '644', '645', '646', '647', '648', '649', '650', '651', '652', '653', '654', '655', '656', '657', '658', '659', '660', '661', '662', '663', '664', '665', '666', '667', '668', '669', '670', '671', '672', '673', '674', '675', '676', '677', '678', '679', '680', '681', '682', '683', '684', '685', '686', '687', '688', '689', '690', '691', '692', '693', '694', '695', '696', '697', '698', '699', '700', '701', '702', '703', '704', '705', '706', '707', '708', '709', '710', '711', '712', '713', '714', '715', '716', '717', '718', '719', '720', '721', '722', '723', '724', '725', '726', '727', '728', '729', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '740', '741', '742', '743', '744', '745', '746', '747', '748', '749', '750', '751', '752', '753', '754', '755', '756', '757', '758', '759', '760', '761', '762', '763', '764', '765', '766', '767'] not in index"
metaextractor_article-relevance-prediction_1 exited with code 1

Which seems to be coming from this line:

nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)

in the relevance_prediction() function.

I'll try a bit of debugging to see why it's popping up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions