我想删除字符串中的单词列表
static string[] BAD_WORDS = {
"hdtv", "exm", "RMT", "DD5", "YTS", "TURKISH", "VIDEOFLIX", "Gisaengchung", "KOREAN", "8CH",
"BluRay", "Hdcam", "-", "XviD", "AC3", "EVO", "WEBRip", "FGT", "MP3", "CMRG", "Pahe", "webdl",
"10bit", "720p", "1080p", "480p", "WEB-DL", "H264", "H265", "x264", "x265", "800MB", "900MB",
"HEVC", "PSA", "RARBG", "6CH", "2CH", "CAMRip", "Rip", "AVS", "RMX", "RMTeam", "mSD", ".",
"SVA", "MkvCage", "MeGusta", "TBS", "AMZN", "DDP5.1", "DDP5", "SHITBOX", "NITRO", "WEB", "DL",
"1080", "720", "480", "MrMovie", "BWBP", "NTG", "HMAX", "Atmos", "MZABI", "2018", "2019", "2020",
"2021", "2022", "MRCS", "/", "GalaxyRG", "HDR", "YTS.LT", "1400MB", "H.264", "H.265", "YTS.MX",
"DV", "PSiG", "ION10", "NTb", "SYNCOPY", "PHOENIX", "MinX", "300MB", "150MB", "AFG", "Cakes",
"2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "@Gemovies", "M3", "DD5.1"
};
我使用了两种方法来实现这一点,一种是linq,另一种是regex
public static string RemoveBadWords(string stringToClean)
{
//var cleaned = string.Join(" ", stringToClean.Split(new string[] { " ", ".", "-" }, StringSplitOptions.None).Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
var cleaned = Regex.Replace(stringToClean, "\\b" + string.Join("\\b|\\b", BAD_WORDS) + "\\b", " ", RegexOptions.IgnoreCase);
return cleaned.Trim();
}
现在看来,regex的性能更好了,例如,这段代码的输出是:
Console.WriteLine(RemoveBadWords("Mortal.Kombat.2021.1080p.WEB-DL.DD5.1.H.264.EVO.M3"));
LINQ:Mortal Kombat 1 H 264
正则表达式:Mortal Kombat 264
现在的问题是为什么在regex方法中没有移除H.264?(仅去除H.)
在性能速度方面,哪种方法更好?
regex中使用的方法是否正确?能不能改进一下,少犯点错?
你的部分问题是你的一些单词是其他单词的子字符串。因此,当您尝试替换较长的变体时,它们将不再存在,因为它们的一部分已经被删除了。解决方法是以反向排序的顺序处理字符串。您遇到的另一个问题是,您试图在“.”
上拆分,而这已经包含在您的一些坏话中。
有多种不同的方法可以做到这一点。正则表达式通常不是答案,但它在这里起作用(尽管它的性能可能不是那么好,如果这很重要的话)。我们首先按降序排列不良单词,然后使用aggregate
迭代替换每个不良单词。我们需要使用regex.escape
来确保嵌入的“。字符不被解释为特殊字符。最后,我们执行最后一次传递,删除任何剩余的句点和空格。您还必须删除“.”从你原来的坏字列表中输入。
var words = BAD_WORDS.OrderByDescending(c => c);
var result = words.Aggregate(stringToClean, (p,c) => Regex.Replace(p, "\\b" + Regex.Escape(c) + "\\b", "", RegexOptions.IgnoreCase));
result = Regex.Replace(result, "[\\.\\s]+", " ").Trim();
我不建议在生产场景中使用这段代码,但很明显,您正在清理torrent电影文件,我相信您真的不会在意。
为了好玩,这里有一个使用当前单词列表的主要基于正则表达式的解决方案:
var pattern = @"\b(?:hd(?:tv|cam|r)|e(?:xm|vo)|RMT|DDP?5(?:\.1)?|YTS|Turkish|VideoFlix|Gisaengchung|Korean|8CH|BluRay|-|XVid|A(?:c3|VS)|web(?:-?(?:rip|dl))?|fgt|mp3|cmrg|pahe|10bit|(?:720|480|1080)[pi]?|H\.?26[45]|x26[45]|\d{3,}MB|H(?:MAX|EVC)|PS(?:A|iG)|RARBG|[26]CH|(?:CAM)?Rip|RM(?:X|Team)|msd|sva|mkvcage|megusta|tbs|amz|shitbox|nitro|Mr(?:Movie|CS)|BWBP|NT[bG]|Atmos|MZABI|20(?:1\d|2[01])|\/|GalaxyRG|YTS(?:\.(?:LT|MX))?|DV|ION10|SYNCOPY|Phoenix|Minx|AFG|Cakes|@Gemovies|M3)\b";
var inner = new Regex(pattern, RegexOptions.IgnoreCase|RegexOptions.Compiled);
var outer = new Regex(@"[\.\s]+", RegexOptions.Compiled);
var result = outer.Replace(inner.Replace(stringToClean, ""), " ").Trim();